The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.
We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
This is in preparation for using the sizing plan during table creation
where we never have size stats, and hints are the only determining
factor for target tablet count.
Resize plan making will now happen in two stages:
1) Determine desired tablet counts per table (sizing plan)
2) Schedule resize decisions
We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.
We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.
Also, this sizing plan will be later plugged into a virtual table for
observability.
Logic is preserved since target tablet size is constant for all
tables.
Dropping d.target_max_tablet_size() will allow us to move it
to the load_balancer scope.
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.
Example:
4 nodes, -c1, rf=3, initial_tablets_scale=10
Before: initial_tablet_count=20, tablet-per-shard=15
After: initial_tablet_count=14, tablets-per-shard=10.5
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes#21967
In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.
test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.
To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.
I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.
Fixes#22790Closesscylladb/scylladb#22812
It was only needed there for create_pending_deletion_log() method to get
private "_storage" from sstable. Now it's all gone and friendship can be
broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... to where it belongs -- to the filesystem storage driver itself.
Continuation of the previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question walks the list of sstables and accumulates
sstables' prefixes into a set on pending_delete_result object. The set
in question is not used at all in this method and is in fact alien to it
-- the p.d._result object is used by the filesystem storage driver as
atomic deletion prepare/commit transparent context.
Said that, move the whole pending_delete_result to where it belongs and
relax the create_pending_deletion_log() to only return the log
directory path string.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current code takes sstable prefix() (e.g. the /foo/bar string), then
trims from its fron the basedir (e.g. the /foo/ string) and then writes
the remainder, a slash and TOC component name (e.g. the xxx-TOC.txt
string). The final result is "bar/xxx-TOC.txt" string.
The taking into account sstable.toc_filename() renders into
sstable.prefix + \slash + component-name, the above result can be
achieved by trimming basedir directory from toc_filename().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
possibly triggering a segfault.
Drop the `static` to make the lambda local and resolve this problem.
Fixes: scylladb/scylladb#22756Closesscylladb/scylladb#22776
Recently, when running Alternator tests we get hundreds of warnings like
the following from basically all test files:
/usr/lib/python3.12/site-packages/botocore/crt/auth.py:59:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
/usr/local/lib/python3.12/site-packages/pytest_elk_reporter.py:299:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
These warnings all come from two libraries that we use in the tests -
botocore is used by Alternator tests, and elk reporter is a plugin that
we don't actually use, but it is installed by dtest and we often see
it in our runs as well. These warnings have zero interest to us - not
only do we not care if botocore uses some deprecated Python APIs and
will need to be updated in the future, all these warnings are hiding
*real* warnings about deprecated things we actually use in our own
test code.
The patch modifies test/pytest.ini (used by all our Python tests,
including but not limited to Alternator tests) to ignore deprecation
warnings from *inside* these two libraries, botocore and elk_reporter.
After this patch, test/alternator/run finishes without any warnings
at all. test/cqlpy does still have a few warnings left, which earlier
were hidden by the thousands of spammy warning eliminated in this patch.
We fix one of these warnings in this patch:
ResultSet indexing support will be removed in 4.0.
Consider using ResultSet.one()
by doing exactly what the warning recommended.
Some deprecation warnings in test/cqlpy remain in calls to
get_query_trace(). The "blame" for these warning is misplaced - this
function is part of the cassandra driver, but Python seems to think it's
part of our test code so I can't avoid them with the pytest.ini trick,
I'm not sure why. So I don't know yet how to eliminate these last warnings.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22881
Modify CMake configuration to only apply "-Xclang" options when building
with the Clang compiler. These options are Clang-specific and can cause
errors or warnings when used with other compilers like g++.
This change:
- Adds compiler detection to conditionally apply Clang-specific flags
- Prevents build failures when using non-Clang compilers
Previously, the build system would apply these flags universally, which
could lead to compilation errors with other compilers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22899
Use std::to_underlying() when comparing unsigned types with enumeration values
to fix type mismatch warnings in GCC-14. This specifically addresses an issue in
utils/advanced_rpc_compressor.hh where comparing a uint8_t with 0 triggered a
'-Werror=type-limits' warning:
```
error: comparison is always false due to limited range of data type [-Werror=type-limits]
if (x < 0 || x >= static_cast<underlying>(type::COUNT))
~~^~~
```
Using std::to_underlying() provides clearer type semantics and avoids these kind
of comparison warnings. This change improves code readability while maintaining
the same behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22898