Commit Graph

46837 Commits

Author SHA1 Message Date
Tomasz Grabiec
f1bda8d4c1 tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.

There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.

If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.

If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.

The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
94b5165ac7 tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.

We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
dd68c1e526 tablets: load_balancer: Determine desired count from size separately from count from options
For debugging purposes. Later we will want to know which rule
determined the count.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
e4c5e2ab55 tablets: load_balancer: Determine resize decision from target tablet count
The flow is simpler this way, since the decision cannot now be
mismatched with target tablet count.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
35192e2d6f tablets: load_balancer: Allow splits even if table stats not available
This is in preparation for using the sizing plan during table creation
where we never have size stats, and hints are the only determining
factor for target tablet count.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
d3ffea77e6 tablets: load_balancer: Extract make_sizing_plan()
Resize plan making will now happen in two stages:

  1) Determine desired tablet counts per table (sizing plan)
  2) Schedule resize decisions

We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.

We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.

Also, this sizing plan will be later plugged into a virtual table for
observability.
2025-02-19 14:40:06 +01:00
Tomasz Grabiec
33db0d4fea tablets: Add formatter for resize_decision::way_type 2025-02-19 14:39:40 +01:00
Tomasz Grabiec
b7e5919fdd tablets: load_balancer: Simplify resize_urgency_cmp()
Logic is preserved since target tablet size is constant for all
tables.

Dropping d.target_max_tablet_size() will allow us to move it
to the load_balancer scope.
2025-02-19 14:39:40 +01:00
Tomasz Grabiec
997007a2df tablets: load_balancer: Keep config items as instance members
It fits preexisting pattern for other config items, and makes the code
less cluttered because we don't have to carry config items across
calls.
2025-02-19 14:39:39 +01:00
Tomasz Grabiec
ce959818a3 locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() 2025-02-19 14:38:50 +01:00
Tomasz Grabiec
f043c83ba5 tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.

Example:

  4 nodes, -c1, rf=3, initial_tablets_scale=10

  Before: initial_tablet_count=20, tablet-per-shard=15
  After: initial_tablet_count=14, tablets-per-shard=10.5
2025-02-19 14:38:50 +01:00
Tomasz Grabiec
2463e524ed tablets: Set default initial tablet count scale to 10
This will result in new tables having at least 10 tablet replicas per
shard by default.

We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.

Fixes #21967

In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.

test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.

To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.

I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
2025-02-19 14:38:50 +01:00
Tomasz Grabiec
8eedb551b5 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
To insert preemption points later.
2025-02-19 14:38:49 +01:00
Tomasz Grabiec
eef18d879c tablets: load_balancer: Extract get_schema_and_rs()
For better readability.
2025-02-19 14:38:49 +01:00
Tomasz Grabiec
9d600dd783 tablets: load_balancer: Drop test_mode
tablets_test is now creating proper schema in the database, so
test_mode is no longer needed.
2025-02-19 14:38:48 +01:00
yangpeiyu2_yewu
0de232934a mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes #22790

Closes scylladb/scylladb#22812
2025-02-19 13:00:45 +02:00
Pavel Emelyanov
d79eec2e76 sstable: Unfriend sstable_directory class
It was only needed there for create_pending_deletion_log() method to get
private "_storage" from sstable. Now it's all gone and friendship can be
broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
96a867c869 sstable_directory: Move sstable_directory::pending_delete_result
... to where it belongs -- to the filesystem storage driver itself.
Continuation of the previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
f6de6d6887 sstable_directory: Calculate prefixes outside of create_pending_deletion_log()
The method in question walks the list of sstables and accumulates
sstables' prefixes into a set on pending_delete_result object. The set
in question is not used at all in this method and is in fact alien to it
-- the p.d._result object is used by the filesystem storage driver as
atomic deletion prepare/commit transparent context.

Said that, move the whole pending_delete_result to where it belongs and
relax the create_pending_deletion_log() to only return the log
directory path string.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
b0c1a77528 sstable_directory: Introduce local pending_delete_log variable
This is simply to reduce the churn in the next patch, nothing special
here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
5b92c4549e sstable_directory: Relax toc file dumping to deletion log
The current code takes sstable prefix() (e.g. the /foo/bar string), then
trims from its fron the basedir (e.g. the /foo/ string) and then writes
the remainder, a slash and TOC component name (e.g. the xxx-TOC.txt
string). The final result is "bar/xxx-TOC.txt" string.

The taking into account sstable.toc_filename() renders into
sstable.prefix + \slash + component-name, the above result can be
achieved by trimming basedir directory from toc_filename().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Botond Dénes
820f196a49 replica/database: setup_scylla_memory_diagnostics_producer() un-static semaphore dump lambda
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
  static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
  possibly triggering a segfault.

Drop the `static` to make the lambda local and resolve this problem.

Fixes: scylladb/scylladb#22756

Closes scylladb/scylladb#22776
2025-02-19 12:22:16 +03:00
Nadav Har'El
a7bf36831c test: remove spammy deprecation warnings
Recently, when running Alternator tests we get hundreds of warnings like
the following from basically all test files:

    /usr/lib/python3.12/site-packages/botocore/crt/auth.py:59:
    DeprecationWarning: datetime.datetime.utcnow() is deprecated and
    scheduled for removal in a future version. Use timezone-aware objects
    to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).

    /usr/local/lib/python3.12/site-packages/pytest_elk_reporter.py:299:
    DeprecationWarning: datetime.datetime.utcnow() is deprecated and
    scheduled for removal in a future version. Use timezone-aware objects
    to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).

These warnings all come from two libraries that we use in the tests -
botocore is used by Alternator tests, and elk reporter is a plugin that
we don't actually use, but it is installed by dtest and we often see
it in our runs as well. These warnings have zero interest to us - not
only do we not care if botocore uses some deprecated Python APIs and
will need to be updated in the future, all these warnings are hiding
*real* warnings about deprecated things we actually use in our own
test code.

The patch modifies test/pytest.ini (used by all our Python tests,
including but not limited to Alternator tests) to ignore deprecation
warnings from *inside* these two libraries, botocore and elk_reporter.

After this patch, test/alternator/run finishes without any warnings
at all. test/cqlpy does still have a few warnings left, which earlier
were hidden by the thousands of spammy warning eliminated in this patch.

We fix one of these warnings in this patch:

    ResultSet indexing support will be removed in 4.0.
    Consider using ResultSet.one()

by doing exactly what the warning recommended.

Some deprecation warnings in test/cqlpy remain in calls to
get_query_trace(). The "blame" for these warning is misplaced - this
function is part of the cassandra driver, but Python seems to think it's
part of our test code so I can't avoid them with the pytest.ini trick,
I'm not sure why. So I don't know yet how to eliminate these last warnings.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22881
2025-02-19 12:15:51 +03:00
Avi Kivity
45b2026209 service: raft: drop unused dependency from group0_state_machine_merger.hh
Reduces dependency load.

Closes scylladb/scylladb#22781
2025-02-19 12:14:58 +03:00
Kefu Chai
d1f117620a build: restrict -Xclang options to Clang compiler only
Modify CMake configuration to only apply "-Xclang" options when building
with the Clang compiler. These options are Clang-specific and can cause
errors or warnings when used with other compilers like g++.

This change:
- Adds compiler detection to conditionally apply Clang-specific flags
- Prevents build failures when using non-Clang compilers

Previously, the build system would apply these flags universally, which
could lead to compilation errors with other compilers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22899
2025-02-19 12:13:35 +03:00
Kefu Chai
d384b0a63e utils: use std::to_underlying() when appropriate
Use std::to_underlying() when comparing unsigned types with enumeration values
to fix type mismatch warnings in GCC-14. This specifically addresses an issue in
utils/advanced_rpc_compressor.hh where comparing a uint8_t with 0 triggered a
'-Werror=type-limits' warning:
```
    error: comparison is always false due to limited range of data type [-Werror=type-limits]
    if (x < 0 || x >= static_cast<underlying>(type::COUNT))
        ~~^~~
```
Using std::to_underlying() provides clearer type semantics and avoids these kind
of comparison warnings. This change improves code readability while maintaining
the same behavior.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22898
2025-02-19 12:12:28 +03:00
Benny Halevy
cc281ff88d test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
and return the keyspace unique name to the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 09:35:33 +02:00
Aleksandra Martyniuk
f8e4198e72 service: tasks: hold token_metadata_ptr in tablet_virtual_task
Hold token_metadata_ptr in tablet_virtual_task methods that iterate
over tablets, to keep the tablet_map alive.

Fixes: https://github.com/scylladb/scylladb/issues/22316.

Closes scylladb/scylladb#22740
2025-02-19 09:33:53 +02:00
Dusan Malusev
4e6ea232d2 docs: add instruction for installing cassandra-stress
Signed-off-by: Dusan Malusev <dusan.malusev@scylladb.com>

Closes scylladb/scylladb#21723
2025-02-19 09:25:16 +02:00
Benny Halevy
cbe79b20f7 test/repair: create_table_insert_data_for_repair: create keyspace with unique name
and return it to the caller

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:56:07 +02:00
Benny Halevy
9829b1594f topology_tasks/test_tablet_tasks: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:59 +02:00
Benny Halevy
12f85ce57c topology_tasks/test_node_ops_tasks: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:59 +02:00
Benny Halevy
0564e95c51 topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:59 +02:00
Benny Halevy
46b1850f0c topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:59 +02:00
Benny Halevy
b810791fbb topology_custom/test_view_build_status: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:59 +02:00
Benny Halevy
2d4af01281 topology_custom/test_truncate_with_tablets: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:52:58 +02:00
Benny Halevy
16ef78075c topology_custom/test_topology_failure_recovery: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
96d327fb83 topology_custom/test_tablets_removenode: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
f30e4c6917 topology_custom/test_tablets_migration: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
20f7eda16e topology_custom/test_tablets_merge: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
5ff3153912 topology_custom/test_tablets_intranode: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
e59aca66bf topology_custom/test_tablets_cql: use new_test_keyspace
And create_new_test_keyspace when we need drop
to be explicit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
6b37d04aa9 topology_custom/test_tablets2: use *new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
0b88ea9798 topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
649e68c6db topology_custom/test_tablets: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
005ceb77d3 topology_custom/test_table_desc_read_barrier: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
50a8f5c1c0 topology_custom/test_shutdown_hang: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
4fd6c2d24e topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
72bc4016e7 topology_custom/test_rpc_compression: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00
Benny Halevy
47326d01b7 topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-19 08:43:35 +02:00