Commit Graph

82 Commits

Author SHA1 Message Date
Tomasz Grabiec
7d0f4c10a2 test: tablets: Add test for failed streaming being fenced away 2023-12-06 18:37:01 +01:00
Patryk Jędrzejczak
cd7b282db6 test: ManagerClient: introduce servers_add
We add a new function - servers_add - that allows adding multiple
servers concurrently to a cluster. It makes use of a concurrent
bootstrap now supported in the raft-based topology.

servers_add doesn't have the replace_cfg parameter. The reason is
that we don't support concurrent replace operations, at least for
now.

There is an implementation detail in ScyllaCluster.add_servers. We
cannot simply do multiple calls to add_server concurrently. If we
did that in an empty cluster, every node would take itself as the
only seed and start a new cluster. To solve this, we introduce a
new field - initial_seed. It is used to choose one of the servers
as a seed for all servers added concurrently to an empty cluster.

Note that the add_server calls in asyncio.gather in add_servers
cannot race with each other when setting initial_seed because
there is only one thread.

In the future, we will also start all initial servers concurrently
in ScyllaCluster.install_and_start. The changes in this commit were
designed in a way that will make changing install_and_start easy.
2023-11-24 09:39:01 +01:00
Patryk Jędrzejczak
aca90e6640 test: ManagerClient: introduce _create_server_add_data
We introduce this function to avoid code duplication. After the
following commits, it will also be used in the new
ManagerClient.servers_add function.
2023-11-24 09:39:01 +01:00
Patryk Jędrzejczak
9775b1c12d test: server_add: wait until the node being replaced is dead
In the following commits, we make the topology coordinator reject
join requests if the node being replaced is considered alive by the
gossiper. Before making this change, we need to adapt the testing
framework so that we don't have flaky replace operations that fail
because the node being replaced hasn't been marked as dead yet. We
achieve this by waiting until all other running nodes see the node
being replaced as dead in all replace operations.
2023-11-21 12:39:16 +01:00
Patryk Jędrzejczak
18ed89f760 test: server_add: add support for expected errors
After this change, if we try to add a server and it fails with an
expected error, the add_server function will not throw. Also, the
server will be correctly installed and stopped.

Two issues are motivating this feature.

The first one is that if we want to add a server while expecting
an error, we have to do it in two steps:
- call server_add with the start parameter set to False,
- call server_start with the expected_error parameter.
It is quite inconvenient.

The second one is that we want to be able to test the replace
operation when it is considered incorrect, for example when we try
to replace an alive node. To do this, we would have to remove
some assertions from ScyllaCluster.add_server. However, we should
not remove them because they give us clear information when we
write an incorrect test. After adding the expected_error parameter,
we can ignore these assertions only when we expect an error. In
this way, we enable testing failing replace operations without
sacrificing the testing framework's protection.
2023-11-21 12:39:16 +01:00
Paweł Zakrzewski
a0dcc154c1 test: add the auth_cluster test suite
This commit adds the auth_cluster test suite to test a custom scenario
involving password authentication:
- create a cluster of 2 nodes with password authentication
- down one node
- the other node should refuse login stating that it couldn't reach
  QUORUM

References ScyllaDB OSS #2339
2023-11-13 14:04:28 +01:00
Kamil Braun
7dcee7de02 test/pylib: implement expected_error for decommission and removenode
You can now pass `expected_error` to `ManagerClient.decommission_node`
and `ManagerClient.remove_node`. Useful in combination with error
injections, for example.

Closes scylladb/scylladb#15650
2023-10-17 16:25:43 +03:00
Kamil Braun
05ede7a042 test/pylib: always return a response from put_json
In 20ff2ae5e1 mutating endpoints were
changed to use PUT. But some of them return a response, and I forgot to
provide `response_type` parameter to `put_json` (which causes
`RESTClient` to actually obtain the response). These endpoints now
return `None`.

Fix this.

Closes scylladb/scylladb#15674
2023-10-09 14:35:04 +03:00
Kamil Braun
d3bc0d47e0 test/pylib: always return data as JSON from endpoints
Some endpoint handlers return JSON, some return text, some return empty
responses.

Reduce the number of different handler types by making the text case a
subcase of the JSON case. This also simplifies some code on the
`ManagerClient` side, which would have to deserialize data from text
(because some endpoint handlers would serialize data into text for no
particular reason). And it will allow reducing boilerplate in later
commits even further.
2023-10-06 11:24:02 +02:00
Kamil Braun
f848d7b5c0 test/pylib: use JSON data to pass expected_error in server_start
Most other endpoints receive data through request body as JSON, this one
endpoint is an exception for some reason. Make it consistent with
others.
2023-10-06 10:55:45 +02:00
Kamil Braun
20ff2ae5e1 test/pylib: use PUT instead of GET for mutating endpoints
`ScyllaClusterManager` registers a bunch of HTTP endpoints which
`ManagerClient` uses to perform operations on a cluster during
a topology test.

The endpoints were inconsistently using verbs, like using GET for
endpoints that would have side effects. Use PUT for these.
2023-10-06 10:55:45 +02:00
Kamil Braun
33463df7d2 test/pylib: fix some type errors 2023-10-06 10:55:45 +02:00
Botond Dénes
70e26e5a10 test/pylib: add REST methods to get node exe and workdir paths 2023-09-22 02:53:15 -04:00
Botond Dénes
7e7101c180 Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes"
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.

The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.

Fixes: #15371
2023-09-13 10:45:37 +03:00
Botond Dénes
46e37436d0 test/pylib: add REST methods to get node exe and workdir paths 2023-09-11 07:02:14 -04:00
Aleksandra Martyniuk
ede8182dd4 test: fix types and variable names in wait_for_host_down
Fix types and variable names in ManagerClient::wait_for_host_down
and related methods.
2023-09-05 15:01:59 +02:00
Mikołaj Grzebieluch
a031a14249 tests: add asynchronous log browsing functionality
Add a class that handles log file browsing with the following features:
* mark: returns "a mark" to the current position of the log.
* wait_for: asynchronously checks if the log contains the given message.
* grep: returns a list of lines matching the regular expression in the log.

Add a new endpoint in `ManagerClient` to obtain the scylla logfile path.

Fixes #14782

Closes #14834
2023-08-25 14:19:09 +02:00
Kamil Braun
169d19e5b0 Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.

The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.

Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.

Fixes #15025

Closes #15113

* github.com:scylladb/scylladb:
  test: add test_raft_ignore_nodes
  test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
  raft topology: pass ignore_nodes to {replace, remove}_with_repair
  raft topology: exec_global_command: add ignore_nodes to exclude_nodes
  raft topology: exec_global_command: change type of exclude_nodes
  topology_state_machine: extend request_param with a set of raft ids
  raft topology: set ignore_nodes in raft_removenode and raft_replace
  utils: introduce split_comma_separated_list
  raft topology: add the ignore_nodes column to system.topology
2023-08-22 18:04:59 +02:00
Kamil Braun
cdc3cd2b79 Merge 'raft: add fencing tests' from Petr Gusev
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.

The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.

Closes #14901

* github.com:scylladb/scylladb:
  test_fencing: add test_fence_hints
  test.py: output the skipped tests
  test.py: add skip_mode decorator and fixture
  test.py: add mode fixture
  hints: add debug log for dropped hints
  hints: send_one_hint: extend the scope of file_send_gate holder
  pylib: add ScyllaMetrics
  hints manager: add send_errors counter
  token_metadata: add debug logs
  fencing: add simple data plane test
  random_tables.py: add counter column type
  raft topology: don't increment version when transitioning to node_state::normal
2023-08-22 16:28:21 +02:00
Patryk Jędrzejczak
6818d13f7d test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
ManagerClient.remove_node allows passing ignore_dead only as
List[IPAddress]. However, raft_removenode currently supports
only host ids. To write a test that passes ignore_dead to
ManagerClient.remove_node in the Raft topology mode, we allow
passing ignore_dead as List[HostId].

Note that we don't want to use List[IPAddress | HostId] because
mixing IP addresses and host ids fails anyway. See
ss::remove_node.set(...) in api::set_storage_service.
2023-08-22 14:19:09 +02:00
Petr Gusev
0b7a90dff6 pylib: add ScyllaMetrics
This patch adds facilities to work
with Scylla metrics from test.py tests.
The new metrics property was added to
ManagerClient, its query method
sends a request to Scylla metrics
endpoint and returns and object
to conveniently access the result.

ScyllaMetrics is copy-pasted from
test_shedding.py. It's difficult
to reuse code between 'new' and 'old'
styles of tests, we can't just import
pylib in 'old' tests because of some
problems with python search directories.
A past commit of mine that attempted
to solve this problem was rejected on review.
2023-08-22 14:31:04 +04:00
Gleb Natapov
517f6bfa8a test: add rebuild test
Add simple rebuild test that makes sure that rebuild operation does not
fail.
2023-08-10 16:46:13 +03:00
Konstantin Osipov
df97135583 test.py: forward the optional property file when creating a server
To support multi-DC tests we need to provide a property
file when creating a server.
Forward it from the test client to test.py.

Closes #14683
2023-08-02 13:45:19 +02:00
Alejo Sanchez
2194d8864b test/pylib: remove redundant method
The ManagerClient.get_cql method is defined twice. Remove one and fix
the assert.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-18 13:33:46 +02:00
Kamil Braun
3464877276 test: manager_client: make con_gen for ManagerClient.__init__ nonoptional
`ManagerClient` is given a function that is used to create CQL
connections to the Scylla cluster. For some reason it was typed as
`Optional` even though it was never passed `None`. Fix it.
2023-07-12 11:44:15 +02:00
Kamil Braun
2032d7dbe4 test: scylla_cluster: return the new IP from change_ip API
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
2023-07-06 10:24:46 +02:00
Kamil Braun
b38dcba6ed test: pylib: increase checking period for get_alive_endpoints
`server_sees_others` and similar functions periodically call
`get_alive_endpoints`. The period was `.1` seconds, increase it to `.5`
to reduce the log spam (I checked empirically that `.5` is usually how
long it takes in dev mode on my laptop.)
2023-06-20 13:03:46 +02:00
Kamil Braun
ae92932240 test: pylib: manager_client: get_cql() helper 2023-06-20 13:03:46 +02:00
Kamil Braun
e02249f0cd test: pylib: ScyllaCluster: server pause/unpause API 2023-06-20 13:03:46 +02:00
Kamil Braun
7e56388721 test: pylib: ScyllaCluster: generalize config type for server_add
Generalize from `dict[str, str]` to `dict[str, Any]`.
2023-05-29 11:03:36 +02:00
Kamil Braun
f581282625 test: topology_experimental_raft: test check_and_repair_cdc API 2023-05-08 16:49:01 +02:00
Alejo Sanchez
11561a73cb test/pylib: ManagerClient helpers to wait for...
server to see other servers after start/restart

When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.

Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).

Fixes #13147

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13438
2023-04-20 14:22:31 +02:00
Tomasz Grabiec
041ee3ffdd test: pylib: Add a way to create cql connections with particular coordinators
Usage:

  await manager.driver_connect(server=servers[0])
  manager.cql.execute(f"...", execution_profile='whitelist')
2023-04-13 21:23:03 +02:00
Alejo Sanchez
e3b462507d test/pylib: topology: support clusters of initial size 0
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.

Add a proof-of-concept test to be removed later.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13342
2023-03-31 11:17:58 +02:00
Petr Gusev
e407956e9f scylla_cluster.py: add start flag to server_add
Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.

The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.
2023-03-24 16:08:17 +04:00
Petr Gusev
794d0e4000 ServerInfo: drop host_id
We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.
2023-03-24 16:08:17 +04:00
Petr Gusev
8e3392c64f scylla_cluster.py: add config to server_add
Sometimes when creating a node it's useful
to pass a custom node config.
2023-03-24 16:08:17 +04:00
Petr Gusev
c1d0ee2bce scylla_cluster.py: add expected_error to server_start
Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.
2023-03-24 16:08:11 +04:00
Konstantin Osipov
4ace19928d raft: (test) test ip address change 2023-03-10 19:52:40 +03:00
Botond Dénes
e55f475db1 Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.

Closes #12765

* github.com:scylladb/scylladb:
  test/pylib: use larger timeout for decommission/removenode
  test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
2023-02-13 16:30:24 +02:00
Nadav Har'El
2653865b34 Merge 'test.py: improve test failure handling' from Kamil Braun
Improve logging by printing the cluster at the end of each test.

Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.

Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.

Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.

Closes #12652

* github.com:scylladb/scylladb:
  test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
  test/topology: don't drop random_tables keyspace after a failed test
  test/pylib: mark cluster as dirty after a failed test
  test: pylib, topology: don't perform operations after test on a dirty cluster
  test/pylib: print cluster at the end of test
2023-02-12 12:13:25 +02:00
Kamil Braun
54f85c641d test/pylib: use larger timeout for decommission/removenode
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
2023-02-10 15:56:31 +01:00
Kamil Braun
fde6ad5fc0 test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
Use a more generic name since the constant will also be used as timeout
for decommission and removenode.
2023-02-10 15:56:31 +01:00
Asias He
fc60484422 test: Increase START_TIMEOUT
It is observed that CI machine is slow to run the test. Increase the
timeout of adding servers.
2023-02-03 21:15:08 +08:00
Kamil Braun
a9dbd89478 test/pylib: mark cluster as dirty after a failed test
We don't expect the cluster to be functioning at all after a failed
test. The whole cluster might have crashed, for example. In these
situations the framework would report multiple errors (one for the
actual failure, another for a failed post-condition check because the
cluster was down) which would only obscure the report and make debugging
harder. It's also not safe in general to reuse the cluster in another
test - if the test previous failed, we should not assume that it's in a
valid state.

Therefore, mark the cluster as dirty after a failed test. This will let
us recycle the cluster based on the dirty flag and it will disable
post-condition check after a failed test (which is only done on
non-dirty clusters).

To implement this in topology tests, we use the
`pytest_runtest_makereport` hook which executes after a test finishes
but before fixtures finish. There we store a test-failed flag in a stash
provided by pytest, then access the flag in the `manager` fixture.
2023-02-02 16:35:55 +01:00
Kamil Braun
f4b56cddde test/pylib: print cluster at the end of test
- print the cluster used by the test in `after_test`
- if cluster setup fails in `before_test`, print the cluster together
  with the exception (`after_test` is not executed if `before_test`
  fails)
2023-02-02 15:59:02 +01:00
Kamil Braun
d134c458e5 test/pylib: increase timeout when waiting for cluster before test
Increase the timeout from default 5 minutes to 10 minutes.
Sent as a workaround for #12546 to unblock next promotions.

Closes #12547
2023-01-17 21:03:09 +02:00
Benny Halevy
7d0d9e28f1 test: pylib: ServerInfo: add host_id
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:07 +02:00
Alejo Sanchez
d632e1aa7a test/pytest: add missing import, remove unused import
Add missed import time and remove unused name import.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12446
2023-01-08 17:38:46 +02:00
Petr Gusev
1c23390f12 test.py, allow to specify the node's command line in test
An optional parameter cmdline has been added to
the ManagerClient.server_add method.
It allows you to override the default parameters
set by the SCYLLA_CMDLINE_OPTIONS variable
by changing, adding or deleting individual
items. To change or add a parameter just specify
its name and value one after the other.
To remove parameter use the special keyword
__remove__ as a value. To set a parameter
without a value (such as --overprovisioned)
use the special keyword __missing__ as the value.
2023-01-03 15:24:54 +03:00