pip_packages is an associative array, which in bash is constructed
as ([key]=value...). In our case the value is often empty (indicating
no version constraint). Shellcheck warns against it, since `[key]= x`
could be a mistype of `[key]=x`. It's not in our case, but shellcheck
doesn't know that.
Make shellcheck happier by specifying the empty values explicitly.
Closesscylladb/scylladb#22990
Add verbose logging to identify failing test combinations in multi-DC
setup:
- Log replication factor (RF) and consistency level (CL) for each test
iteration
- Add validation checks for empty result sets
Improve error handling:
- Before indexing in a list, use `assert` to check for its emptiness
- Use assertion failures instead of exceptions for clearer test diagnostics
This change helps debug test failures by showing which RF/CL
combinations cause inconsistent results between zero-token and regular
nodes.
Refs scylladb/scylladb#22967
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22968
The test simulates the cluster getting stuck during upgrade to raft
topology due to majority loss, and then verifies that it's possible to
get out of the situation by performing recovery and redoing the upgrade.
Fixes: #17410Closesscylladb/scylladb#17675
* https://github.com/scylladb/scylladb:
test/topology_experimental_raft: add test_topology_upgrade_stuck
test.py: bump minimum python version to 3.11
test.py: move gather_safely to pylib utils
cdc: generation: don't capture token metadata when retrying update
test.py: topology: ignore hosts when waiting for group0 consistency
raft: add error injection that drops append_entries
topology_coordinator: add injection which makes upgrade get stuck
because we don't care about the exact output of grep, let's silence
its output. also, no need to check for the string is empty, so let's
just use the status code of the grep for the return value of the
function, more idiomatic this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22737
In some cases the paused/unpaused node can hang not after 30s timeout.
This make the test flaky. Change the condition to always check the
coordinator's log if there is a hung node.
Add `stop_after_streaming` to the list of error injections which can
cause a node's hang.
Also add a wait for a new coordinator election in cluster events
which cause such elections.
Closesscylladb/scylladb#22825
`slice.is_reversed()` was falsely flagged as accessing moved data, since
the underlying enum_set remains valid after move. However, to improve code
clarity and silence the warning, now reference `command->slice` directly
instead, which is guaranteed to be valid as the move target.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22971
Eliminate one try/catch block around call to wr.close()
by using coroutine::as_future.
Mark error paths as `[[unlikely]]`.
Use `coroutine::return_exception_ptr` to avoid rethrowing
the final exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22831
Before these changes, the script didn't update the listed pip packages
if they were already installed. If the latest version of Scylla started
using new features and required an updated Python driver, for example,
the developers (and possibly the user) were forced to update it manually.
In this commit, we modify the script so that it updates the installed
packages when run. This should make things easier for everyone.
Closesscylladb/scylladb#22912
Replace complex boolean expression:
```py
not driver_response_future.has_more_pages or not all_pages
```
with clearer equivalent:
```py
driver_response_future.has_more_pages and all_pages
```
The new expression is more intuitive as it directly checks for both
conditions (having more pages and wanting all pages) rather than using double
negation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22969
Before the limited voters feature, the "raft_ignore_nodes" test was
relying upon the fact that all nodes will become voters.
With the limited voters feature, the test needs to be adjusted to
ensure that we do not lose the majority of the cluster. This could
happen when there are 7 nodes, but only 5 of them are voters - then if
we kill 3 nodes randomly we might end up with only 2 voters left.
Therefore we need to ensure that we only stop the appropriate number of
voter nodes. So we need to determine which nodes became voters and which
ones are non-voters, and select the nodes to be stopped based on that.
That means with 7 nodes and 5 voters, we can stop up to 2 voter nodes,
but at least one of the stopped nodes must be a non-voter.
Fixes: scylladb/scylladb#22902
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969Closesscylladb/scylladb#22904
Now that we support suite subfolders, there is no
need to create an own suite for topology_tasks and topology_random_failures.
Closesscylladb/scylladb#22879
* https://github.com/scylladb/scylladb:
test.py: merge topology_tasks suite into topology_custom suite
test.py: merge topology_random_failures suite into topology_customs
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.
This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
manner.
Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also
performs a refactoring to populate all init data to system tables even
before group0 creation(via `raft_initialize_discovery_leader` function).
Now because `raft_initialize_discovery_leader` happens before the group
0 creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.
Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.
By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#21114Closesscylladb/scylladb#22484
* https://github.com/scylladb/scylladb:
storage_service: Remove the variable _manage_topology_change_kind_from_group0
storage_service: fix indentation after the previous commit
raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
service/raft: Refactor mutation writing helper functions.
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.
Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.
Fixes#22715
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22922
Replace boost::range::find() calls with std::ranges::find(). This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22942
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.
Fixes#21859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22933
This PR improves and refactors the test.topology.util new_test_keyspace generator
and adds a corresponding create_new_test_keyspace function to be used by most if not
all topology unit tests in order to standardize the way the tests create keyspaces
and to mitigate the python driver create keyspace retry issue: https://github.com/scylladb/python-driver/issues/317Fixes#22342Fixes#21905
Refs https://github.com/scylladb/scylla-enterprise/issues/5060
* No backport required, though may be desired to stabilize CI also in release branches.
Closesscylladb/scylladb#22399
* github.com:scylladb/scylladb:
test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
test/repair: create_table_insert_data_for_repair: create keyspace with unique name
topology_tasks/test_tablet_tasks: use new_test_keyspace
topology_tasks/test_node_ops_tasks: use new_test_keyspace
topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
topology_custom/test_view_build_status: use new_test_keyspace
topology_custom/test_truncate_with_tablets: use new_test_keyspace
topology_custom/test_topology_failure_recovery: use new_test_keyspace
topology_custom/test_tablets_removenode: use create_new_test_keyspace
topology_custom/test_tablets_migration: use new_test_keyspace
topology_custom/test_tablets_merge: use new_test_keyspace
topology_custom/test_tablets_intranode: use new_test_keyspace
topology_custom/test_tablets_cql: use new_test_keyspace
topology_custom/test_tablets2: use *new_test_keyspace
topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
topology_custom/test_tablets: use new_test_keyspace
topology_custom/test_table_desc_read_barrier: use new_test_keyspace
topology_custom/test_shutdown_hang: use new_test_keyspace
topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
topology_custom/test_rpc_compression: use new_test_keyspace
topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace
topology_custom/test_raft_no_quorum: use new_test_keyspace
topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace
topology_custom/test_query_rebounce: use new_test_keyspace
topology_custom/test_not_enough_token_owners: use new_test_keyspace
topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace
topology_custom/test_node_isolation: use create_new_test_keyspace
topology_custom/test_mv_topology_change: use new_test_keyspace
topology_custom/test_mv_tablets_replace: use new_test_keyspace
topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace
topology_custom/test_mv_tablets: use new_test_keyspace
topology_custom/test_mv_read_concurrency: use new_test_keyspace
topology_custom/test_mv_fail_building: use new_test_keyspace
topology_custom/test_mv_delete_partitions: use new_test_keyspace
topology_custom/test_mv_building: use new_test_keyspace
topology_custom/test_mv_backlog: use new_test_keyspace
topology_custom/test_mv_admission_control: use new_test_keyspace
topology_custom/test_major_compaction: use new_test_keyspace
topology_custom/test_maintenance_mode: use new_test_keyspace
topology_custom/test_lwt_semaphore: use new_test_keyspace
topology_custom/test_ip_mappings: use new_test_keyspace
topology_custom/test_hints: use new_test_keyspace
topology_custom/test_group0_schema_versioning: use new_test_keyspace
topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace
topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace
topology_custom/test_read_repair: use new_test_keyspace
topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace
topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace
topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace
topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace
test/topology/util: new_test_keyspace: drop keyspace only on success
test/topology/util: refactor new_test_keyspace
test/topology/util: CREATE KEYSPACE IF NOT EXISTS
test/topology/util: new_test_keyspace: accept ManagerClient
Replace boost::accumulate() calls with std::ranges::fold_left(). This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22924
Replace boost::find() calls with std::ranges::find() and std::ranges::contains()
to leverage modern C++ standard library features. This change reduces external
dependencies and modernizes the codebase.
The following changes were made:
- Replaced boost::find() with std::ranges::find() where index/iterator is needed
- Used std::ranges::contains() for simple element presence checks
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22920
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.
Fixes#22790Closesscylladb/scylladb#22812
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
possibly triggering a segfault.
Drop the `static` to make the lambda local and resolve this problem.
Fixes: scylladb/scylladb#22756Closesscylladb/scylladb#22776
Recently, when running Alternator tests we get hundreds of warnings like
the following from basically all test files:
/usr/lib/python3.12/site-packages/botocore/crt/auth.py:59:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
/usr/local/lib/python3.12/site-packages/pytest_elk_reporter.py:299:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
These warnings all come from two libraries that we use in the tests -
botocore is used by Alternator tests, and elk reporter is a plugin that
we don't actually use, but it is installed by dtest and we often see
it in our runs as well. These warnings have zero interest to us - not
only do we not care if botocore uses some deprecated Python APIs and
will need to be updated in the future, all these warnings are hiding
*real* warnings about deprecated things we actually use in our own
test code.
The patch modifies test/pytest.ini (used by all our Python tests,
including but not limited to Alternator tests) to ignore deprecation
warnings from *inside* these two libraries, botocore and elk_reporter.
After this patch, test/alternator/run finishes without any warnings
at all. test/cqlpy does still have a few warnings left, which earlier
were hidden by the thousands of spammy warning eliminated in this patch.
We fix one of these warnings in this patch:
ResultSet indexing support will be removed in 4.0.
Consider using ResultSet.one()
by doing exactly what the warning recommended.
Some deprecation warnings in test/cqlpy remain in calls to
get_query_trace(). The "blame" for these warning is misplaced - this
function is part of the cassandra driver, but Python seems to think it's
part of our test code so I can't avoid them with the pytest.ini trick,
I'm not sure why. So I don't know yet how to eliminate these last warnings.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22881
Modify CMake configuration to only apply "-Xclang" options when building
with the Clang compiler. These options are Clang-specific and can cause
errors or warnings when used with other compilers like g++.
This change:
- Adds compiler detection to conditionally apply Clang-specific flags
- Prevents build failures when using non-Clang compilers
Previously, the build system would apply these flags universally, which
could lead to compilation errors with other compilers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22899
Use std::to_underlying() when comparing unsigned types with enumeration values
to fix type mismatch warnings in GCC-14. This specifically addresses an issue in
utils/advanced_rpc_compressor.hh where comparing a uint8_t with 0 triggered a
'-Werror=type-limits' warning:
```
error: comparison is always false due to limited range of data type [-Werror=type-limits]
if (x < 0 || x >= static_cast<underlying>(type::COUNT))
~~^~~
```
Using std::to_underlying() provides clearer type semantics and avoids these kind
of comparison warnings. This change improves code readability while maintaining
the same behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22898