Compare commits

..

2045 Commits

Author SHA1 Message Date
Dani Tweig
42629175b5 Update urgent_issue_reminder.yml - run daily
The action will run daily, alerting about urgent issues not touched in the last 7 days.
2025-08-20 14:49:33 +03:00
Botond Dénes
d20304fdf8 Merge 'test.py: dtest: port next_gating tests from commitlog_test.py' from Evgeniy Naydanov
Copy `commitlog_test.py` from scylla-dtest test suite and make it works with `test.py`

As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `skip`, 'skip_if', and `xfail` markers.

test.py uses `commitlog` directory instead of dtest's `commitlogs`.

Also, add `commitlog_segment_size_in_mb: 32` option to test_stop_failure_policy to make _provoke_commitlog_failure
work.

Tests `test_total_space_limit_of_commitlog_with_large_limit` and `test_total_space_limit_of_commitlog_with_medium_limit` use too much disk space and have too big execution time.  Keep them in scylla-dtest for now.

Enable the test in `suite.yaml` (run in dev mode only.)

Additional modifications to test.py/dtest shim code:
- add ScyllaCluster.flush() method
- add ScyllaNode.stress() method
-  add tools/files.py::corrupt_file() function
- add tools/data.py::run_query_with_data_processing() function
- copy some assertions from dtest

Also add missed mode restriction for auth_test.py file.

Closes scylladb/scylladb#24946

* github.com:scylladb/scylladb:
  test.py: dtest: remove slow and greedy tests from commitlog_test.py
  test.py: dtest: make commitlog_test.py run using test.py
  test.py: dtest: add ScyllaCluster.flush() method
  test.py: dtest: add ScyllaNode.stress() method
  test.py: dtest: add tools/data.py::run_query_with_data_processing() function
  test.py: dtest: add tools/files.py::corrupt_file() function
  test.py: dtest: copy some assertions from dtest
  test.py: dtest: copy unmodified commitlog_test.py
2025-08-19 17:25:07 +03:00
Michał Chojnowski
c1b513048c sstables/types.hh: fix fmt::formatter<sstables::deletion_time>
Obvious typo.

Fixes scylladb/scylladb#25556

Closes scylladb/scylladb#25557
2025-08-19 17:21:18 +03:00
Botond Dénes
66db95c048 Merge 'Preserve PyKMIP logs from failed KMIP tests' from Nikos Dragazis
This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures.

Fixes #25339.

Not so important to be backported.

Closes scylladb/scylladb#25367

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Preserve tmpdir from failing KMIP tests
  test/lib: Add option to preserve tmpdir on exception
2025-08-19 13:17:29 +03:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Dawid Pawlik
50eeb11c84 .gitignore: add rust target
When using automatic rust build tools in IDE,
the files generated in `rust/target/` directory
has been treated by git as unstaged changes.

After the change, the generated files will not
pollute the git changes interface.

Closes scylladb/scylladb#25389
2025-08-19 13:09:18 +03:00
Dawid Mędrek
6a71461e53 treewide: Fix spelling errors
The errors were spotted by our GitHub Actions.

Closes scylladb/scylladb#24822
2025-08-19 13:07:43 +03:00
libo2_yewu
fa84e20b7a scripts/coverage.py: correct the coverage report path
the `path/name` directory is not exist and needs to be created first.

Signed-off-by: libo-sober <libo_sober@163.com>

Closes scylladb/scylladb#25480
2025-08-19 13:01:49 +03:00
Avi Kivity
41475858aa storage_proxy: endpoint_filter(): fix rack count confusion
endpoint_filter() is used by batchlog to select nodes to replicate
to.

It contains an unordered_multimap data structure that maps rack names
to nodes.

It misuses std::unordered_map::bucket_count() to count the number of
racks. While values that share a key in a multimap will definitly
be in the same bucket, it's possible for values that don't share a
key to share a bucket. Therefore bucket_count() undercounts the
number of racks.

Fix this by using a more accurate data structure: a map of a set.

The patch changes validated.bucket_count() to validated.size()
and validated.size() to a new variable nr_validated.

The patch does cause an extra two allocations per rack (one for the
unordered_map node, one for the unordered_set bucket vector), but
this is only used for logged batches, so it is amortized over all
the mutations in the logged batch.

Closes scylladb/scylladb#25493
2025-08-19 11:58:39 +03:00
Dawid Mędrek
2227eb48bb test/cqlpy/test_cdc.py: Add validation test for re-attached log tables
When the user disables CDC on a table, the CDC log table is not removed.
Instead, it's detached from the base table, and it functions as a normal
table (with some differences). If that log table lives up to the point
when the user re-enabled CDC on the base table, instead of creating a new
log table, the old one is re-attached to the base.

For more context on that, see commit:
scylladb/scylladb@adda43edc7.

In this commit, we add validation tests that check whether the changes
on the base table after disabling CDC are reflected on the log table
after re-enabling CDC. The definition of the log table should be the same
as if CDC had never been disabled.

Closes scylladb/scylladb#25071
2025-08-19 10:15:41 +02:00
Botond Dénes
f8b79d563a Merge 's3: Minor refactoring and beautification of S3 client and tests' from Ernest Zaslavsky
This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior.

No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility.

Closes scylladb/scylladb#25490

* github.com:scylladb/scylladb:
  s3_client: relocate `req` creation closer to usage
  s3_client: reformat long logging lines for readability
  s3_test: extract file writing code to a function
2025-08-18 18:48:42 +03:00
Aleksandra Martyniuk
a10e241228 replica: lower severity of failure log
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.

Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.

Fixes: https://github.com/scylladb/scylladb/issues/25037.

Closes scylladb/scylladb#25355
2025-08-18 13:30:42 +03:00
Avi Kivity
96956e48c4 Merge 'utils: stall_free: detect clear_gently method of const payload types' from Benny Halevy
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
    using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
    using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```

That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.

This change still doesn't allow directly calling utils::clear_gently
an const objects.

And respective unit tests.

Fixes #24605
Fixed #25026

* This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough)

Closes scylladb/scylladb#24606

* github.com:scylladb/scylladb:
  utils: stall_free: detect clear_gently method of const payload types
  utils: stall_free: clear gently a foreign shared ptr only when use_count==1
2025-08-18 12:52:02 +03:00
Evgeniy Naydanov
ab1a093d94 test.py: dtest: remove slow and greedy tests from commitlog_test.py
Tests test_total_space_limit_of_commitlog_with_large_limit and
test_total_space_limit_of_commitlog_with_medium_limit use too much
disk space and have too big execution time.  Keep them in
scylla-dtest for now.
2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
647043d957 test.py: dtest: make commitlog_test.py run using test.py
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests and tests marked with
`skip`, 'skip_if', and `xfail` markers.

test.py uses `commitlog` directory instead of dtest's
`commitlogs`.

Remove test_stop_failure_policy test because the way how it
provoke commitlog failure (change file permission) doesn't
work on CI.

Enable the test in suite.yaml (run in dev mode only)
2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
5f6e083124 test.py: dtest: add ScyllaCluster.flush() method 2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
c378dc3fab test.py: dtest: add ScyllaNode.stress() method 2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
6f42019900 test.py: dtest: add tools/data.py::run_query_with_data_processing() function 2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
2c4f2de3b0 test.py: dtest: add tools/files.py::corrupt_file() function 2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
80b797e376 test.py: dtest: copy some assertions from dtest
Copy assertions required for commitlog_test.py:
  - assert_almost_equal
  - assert_row_count
  - assert_row_count_in_select_less
  - assert_lists_equal_ignoring_order
2025-08-18 09:42:13 +00:00
Evgeniy Naydanov
1a2d132456 test.py: dtest: copy unmodified commitlog_test.py 2025-08-18 09:42:13 +00:00
Pavel Emelyanov
4f55af9578 Merge 'test.py: pytest: support --mode/--repeat in a common way for all tests' from Evgeniy Naydanov
Implement repetition of files using `pytest_collect_file` hook: run file collection as many times as needed to cover all `--mode`/`--repeat` combinations.  Store build mode and run ID to the stash of repeated item.

Some additional changes done:

- Add `TestSuiteConfig` class to handle all operations with `test_config.yaml`
- Add support for `run_first` option in `test_config.yaml`
- Move disabled test logic to `pytest_collect_file` hook.

These changes allow to to remove custom logic for  `--mode`, `--repeat`, and disabled tests in the code for C++ tests and prepare for switching of Python/CQLApproval/Topology tests to pytest runner.

Also, this PR includes required refactoring changes and fixes:

- Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: `base.py`, `boost.py`, and `unit.py`
- Remove unused imports in `test.py`
- Use the constant for `"suite.yaml"` string
- Some test suites have own test runners based on pytest, and they don't need all stuff we use for `test.py`.  Move all code related to `test.py` framework to `test/pylib/runner.py` and use it as a plugin conditionally (by using `SCYLLA_TEST_RUNNER` env variable.)
- Add `cwd` parameter to `run_process()` methods in `resource_gather` module to avoid using of `os.chdir()` (and sort parameters in the same order as in `subprocess.Popen`.)
- `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item.  Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule.

This PR is a refactoring step for https://github.com/scylladb/scylladb/pull/25443

Closes scylladb/scylladb#25465

* github.com:scylladb/scylladb:
  test.py: pytest: support --mode/--repeat in a common way for all tests
  test.py: pytest: streamline suite configuration handling
  test.py: refactor: remove unused imports in test.py
  test.py: fix run with bare pytest after merge of scylladb/scylladb#24573
  test.py: refactor: move framework-related code to test.pylib.runner
  test.py: resource_gather: add cwd parameter to run_process()
  test.py: refactor: use proper format for extra_scylla_cmdline_options
2025-08-18 12:24:04 +03:00
Avi Kivity
e9928b31b8 Merge 'sstables/trie: add BTI key translation routines' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25396
Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers.

The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.

This series provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.

This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.

For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)

The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.

No backports needed, new functionality.

Closes scylladb/scylladb#25506

* github.com:scylladb/scylladb:
  sstables/trie: add BTI key translation routines
  tests/lib: extract generate_all_strings to test/lib
  tests/lib: extract nondeterministic_choice_stack to test/lib
  sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file
  sstables/mx: move clustering_info from writer.cc to types.hh
  sstables/trie: allow `comparable_bytes_iterator` to return a mutable span
  dht/ring_position: add ring_position_view::weight()
2025-08-18 11:55:26 +03:00
Asias He
082bc70a0a replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
It helps to hide the compaction_group_views from repair subsystem.
2025-08-18 11:01:22 +08:00
Asias He
be15972006 compaction: Move compaction_reenabler to compaction_reenabler.hh
So it can be used without bringing the whole
compaction/compaction_manager.hh.
2025-08-18 11:01:22 +08:00
Asias He
cac4940129 topology_coordinator: Make rpc::remote_verb_error to warning level
This could happen in case the peer node is in shutdown. This is not
something we can not recovery. The log level should be warning instead
of error which our dtest catches for failure of a test.

This was observed in test_repair_one_node_alter_rf dtest.
2025-08-18 11:01:22 +08:00
Asias He
76316f44a7 repair: Add metrics for sstable bytes read and skipped from sstables
scylla_repair_inc_sst_skipped_bytes: Total number of bytes skipped from
sstables for incremental repair on this shard.

scylla_repair_inc_sst_read_bytes : Total number of bytes read from
sstables for incremental repair on this shard.
2025-08-18 11:01:22 +08:00
Asias He
b0364fcba3 test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
Disable incremental repair so that the second repair can still work on
the repaired data set.
2025-08-18 11:01:22 +08:00
Asias He
ad5275fd4c test.py: Add tests for tablet incremental repair
The following tests are added for tablet incremental repair:

- Basic incremental repair

- Basic incremental repair with error

- Minor compaction and incremental repair

- Major compaction and incremental repair

- Scrub compaction and incremental repair

- Cleanup/Upgrade compaction and incremental repair

- Tablet split and incremental repair

- Tablet merge and incremental repair
2025-08-18 11:01:21 +08:00
Asias He
0d7e518a26 repair: Add tablet incremental repair support
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results

    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472
2025-08-18 11:01:21 +08:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Evgeniy Naydanov
e44b26b809 test.py: pytest: support --mode/--repeat in a common way for all tests
Implement repetition of files using pytest_collect_file hook: run
file collection as many times as needed to cover all --mode/--repeat
combinations.  Also move disabled test logic to this hook.

Store build mode and run_id in pytest item stashes.

Simplify support of C++ tests: remove redundant facade abstraction and put
all code into 3 files: base.py, boost.py, and unit.py

Add support for `run_first` option in test_config.yaml
2025-08-17 15:26:23 +00:00
Evgeniy Naydanov
bffb6f3d01 test.py: pytest: streamline suite configuration handling
Move test_config.yaml handling code from common_cpp_conftest.py to
TestSuiteConfig class in test/pylib/runner.py
2025-08-17 12:32:36 +00:00
Evgeniy Naydanov
a2a59b18a3 test.py: refactor: remove unused imports in test.py
Also use the constant for "suite.yaml" string.
2025-08-17 12:32:36 +00:00
Evgeniy Naydanov
a188523448 test.py: fix run with bare pytest after merge of scylladb/scylladb#24573
To run tests with bare pytest command we need to have almost the
same set of options as test.py because we reuse code from test.py.

scylladb/scylladb#24573 added `--pytest-arg` option to test.py but
not to test/conftest.py which breaks running Python tests using
bare pytest command.
2025-08-17 12:32:35 +00:00
Evgeniy Naydanov
600d05471b test.py: refactor: move framework-related code to test.pylib.runner
Some test suites have own test runners based on pytest, and they
don't need all stuff we use for test.py.  Move all code related to
test.py framework to test/pylib/runner.py and use it as a plugin
conditionally (by using TEST_RUNNER variable.)
2025-08-17 12:32:35 +00:00
Evgeniy Naydanov
f2619d2bb0 test.py: resource_gather: add cwd parameter to run_process()
Also done sort arguments in Popen call to match the signature.
2025-08-17 12:32:35 +00:00
Evgeniy Naydanov
cb4d9b8a09 test.py: refactor: use proper format for extra_scylla_cmdline_options
`extra_scylla_cmdline_options` is a list of commandline arguments
and, actually, each argument should be a separate item.  Few configuration
files have `--reactor-backend` option added in the format which doesn't
follow this rule.
2025-08-17 12:32:35 +00:00
Michał Chojnowski
413dcf8891 sstables/trie: add BTI key translation routines
This file provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.

This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.

For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)

The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.
2025-08-15 11:13:00 +02:00
Pavel Emelyanov
f689d41747 Merge 'db/hints: Improve logs' from Dawid Mędrek
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

Fixes scylladb/scylladb#25466

Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.

Closes scylladb/scylladb#25470

* github.com:scylladb/scylladb:
  db/hints: Add new logs
  db/hints: Adjust log levels
  db/hints: Improve logs
2025-08-15 09:34:29 +03:00
Patryk Jędrzejczak
03cc34e3a0 test: test_maintenance_socket: use cluster_con for driver sessions
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.

In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.

Fixes scylladb/scylla-enterprise#5601

This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.

Closes scylladb/scylladb#25510
2025-08-15 09:32:20 +03:00
Pavel Emelyanov
05d8d94257 Merge 'test.py: Add -k=EXPRESSION pytest argument support for boost tests.' from Artsiom Mishuta
follow-up PR after fast fix https://github.com/scylladb/scylladb/pull/25394
should be merged only after  - https://github.com/scylladb/scylla-pkg/pull/5414

Since boost tests run via pure pytest, we can finally run tests using
-k=EXPRESSION pytest argument. This expression will be applied to the "test
function". So it will be possible to run: subset of test functions that match patterns across all boosts tests(functions)

arguments --skip and -k are mutually exclusive
due to -k extends --skip functionality

examples:
```
./build/release/test/boost/auth_passwords_test --list_content
passwords_are_salted*
correct_passwords_authenticate*
incorrect_passwords_do_not_authenticate*

./test.py --mode=dev  -k="correct" -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::incorrect_passwords_do_not_authenticate.dev.1
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  -k="not incorrect and not passwords_are_salted" -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  --skip=incorrect --skip=passwords_are_salted -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  -k="correct and not incorrect" -vv test/boost/auth_passwords_test.cc
ASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1
```

Closes scylladb/scylladb#25400

* github.com:scylladb/scylladb:
  test.py: add -k=EXPRESSION pytest argument support for boost tests.
  test.py: small refactoring of how boost test arguments make
2025-08-15 09:24:56 +03:00
Jenkins Promoter
d4ce070168 Update pgo profiles - aarch64 2025-08-15 05:03:28 +03:00
Jenkins Promoter
c0f691f4d9 Update pgo profiles - x86_64 2025-08-15 04:56:11 +03:00
Michał Chojnowski
5e76708335 tests/lib: extract generate_all_strings to test/lib
This util will be used in another test file in a later commit,
so hoist it to `test/lib`.
2025-08-14 22:38:38 +02:00
Taras Veretilnyk
30ff5942c6 database_test: fix race in test_drop_quarantined_sstables
The test_drop_quarantined_sstables test could fail due to a race between
compaction and quarantining of SSTables. If compaction selects
an SSTable before it is moved to quarantine, and change_state is called during
compaction, the SSTable may already be removed, resulting in a
std::filesystem_error due to missing files.

This patch resolves the issue by wrapping the quarantine operation inside
run_with_compaction_disabled(). This ensures compaction is paused on the
compaction group view while SSTables are being quarantined, preventing the
race.

Additionally, updates the test to quarantine up to 1/5 SSTables instead
of one randomly and increases the number of sstables genereted to improve
test scenario.

Fixes scylladb/scylladb#25487

Closes scylladb/scylladb#25494
2025-08-14 20:23:42 +03:00
Taras Veretilnyk
367eaf46c5 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829
2025-08-14 19:52:04 +03:00
Avi Kivity
1ef6697949 Merge 'service/vector_store_client: Add live configuration update support' from Karol Nowacki
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search service endpoint on the fly.

To improve the clarity the seastar::experimental::http::client is now wrapped in a private http_client class that also holds the host, address, and port information.

Tests have been added to verify that the client correctly handles transitions between enabled/disabled states and successfully switches traffic to a new endpoint after a configuration update.

Closes: VECTOR-102

No backport is needed as this is a new feature.

Closes scylladb/scylladb#25208

* github.com:scylladb/scylladb:
  service/vector_store_client: Add live configuration update support
  test/boost/vector_store_client_test.cc: Refactor vector store client test
  service/vector_store_client: Refactor host_port struct created
  service/vector_store_client: Refactor HTTP request creation
2025-08-14 19:45:06 +03:00
Avi Kivity
fe6e1071d3 Merge 'locator: util: optimize describe_ring' from Benny Halevy
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test for vnode keyspace
and for tablets.

Fixes #24887

* backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters

Closes scylladb/scylladb#24889

* github.com:scylladb/scylladb:
  locator: util: optimize describe_ring
  locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas
  vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map
  locator: effective_replication_map: get_natural_replicas: get is_vnode param
  test: cluster: test_repair: add test_vnode_keyspace_describe_ring
2025-08-14 19:39:17 +03:00
Ernest Zaslavsky
a0016bd0cc s3_client: relocate req creation closer to usage
Move the creation of the `req` object to the point where it is
actually used, improving code clarity and reducing premature
initialization.
2025-08-14 16:18:43 +03:00
Ernest Zaslavsky
6ef2b0b510 s3_client: reformat long logging lines for readability
Break up excessively long logging statements to improve readability
and maintain consistent formatting across the codebase.
2025-08-14 16:18:43 +03:00
Ernest Zaslavsky
29960b83b5 s3_test: extract file writing code to a function
Reduce code doing the same over and over again by extracting file writing code to a function
2025-08-14 16:18:43 +03:00
Artsiom Mishuta
fcd511a531 test.py: add -k=EXPRESSION pytest argument support for boost tests.
Since boost tests run via pure pytest, we can finally run tests using
-k=EXPRESSION pytest argument. This expression will be applied to the "test
function". So it will be possible to run: subset of test functions that match patterns across all boosts tests(functions)

arguments --skip and -k are mutually exclusive
due to -k extends --skip functionality

examples:
./build/release/test/boost/auth_passwords_test --list_content
passwords_are_salted*
correct_passwords_authenticate*
incorrect_passwords_do_not_authenticate*

./test.py --mode=dev  -k="correct" -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::incorrect_passwords_do_not_authenticate.dev.1
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  -k="not incorrect and not passwords_are_salted" -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  --skip=incorrect --skip=passwords_are_salted -vv test/boost/auth_passwords_test.cc
PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1

./test.py --mode=dev  -k="correct and not incorrect" -vv test/boost/auth_passwords_test.cc
ASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1
2025-08-14 14:45:40 +02:00
Artsiom Mishuta
d589f36645 test.py: small refactoring of how boost test arguments make
During migration, boost tests to pytest, a big portion of the logic was
used "as is" with bad code and bugs

This PR refactors the function that makes an argument for the pytest command:

1)refactor how modes are provided
2)refactor how --skip provided
3)remove shlex.split woraround
2025-08-14 14:45:28 +02:00
Abhinav Jha
a0ee5e4b85 raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.

This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23816

Closes scylladb/scylladb#25489
2025-08-14 13:15:16 +02:00
Dawid Mędrek
6f1fb7cfb5 db/hints: Add new logs
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.
2025-08-14 11:45:24 +02:00
Dawid Mędrek
d7bc9edc6c db/hints: Adjust log levels
Some of the logs could be clogging Scylla's logs, so we demote their
level to a lower one.

On the other hand, some of the logs would most likely not do that,
and they could be useful when debugging -- we promote them to debug
level.
2025-08-14 11:45:24 +02:00
Dawid Mędrek
2327d4dfa3 db/hints: Improve logs
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.
2025-08-14 11:45:04 +02:00
Anna Stuchlik
841ba86609 doc: document support for new z3 instance types
This commit adds new z3 instances we now support to the list of GCP instance types.

Fixes https://github.com/scylladb/scylladb/issues/25438

Closes scylladb/scylladb#25446
2025-08-14 10:59:45 +02:00
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Anna Stuchlik
1e5659ac30 doc: add the information about ScyllaDB C# Driver
This commit adds the driver to the list of ScyllaDB drivers,
including the information about:
- CDC integration (not available)
- Tablets (supported)

Fixes https://github.com/scylladb/scylladb/issues/25495

Closes scylladb/scylladb#25498
2025-08-14 11:29:52 +03:00
Patryk Jędrzejczak
6ad2b71d04 Merge 'LWT: communicate RPC errors to the user' from Petr Gusev
Currently, if the accept or prepare verbs fail on the replica side, the user only receives a generic error message of the form "something went wrong for this table", which provides no insight into the root cause. Additionally, these error messages are not logged by default, requiring the user to restart the node with trace or debug logging to investigate the issue.

This PR improves error handling for the accept and prepare verbs by preserving and propagating the original error messages, making it easier to diagnose failures.

backport: not needed, not a bug

Closes scylladb/scylladb#25318

* https://github.com/scylladb/scylladb:
  test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty
  storage_proxy: preserve accept error messages
  storage_proxy: preserve prepare error message
  storage_proxy: fix log message
  exceptions.hh: fix message argument passing
  exceptions: add constructors that accept explicit error messages
2025-08-14 10:23:32 +02:00
Nadav Har'El
2d3c0eb25a test/alternator: speed up test_ttl_expiration_lsi_key
The Alternator test test_ttl.py::test_ttl_expiration_lsi_key is
currently the second-slowest test/alternator test, run a "whopping"
2.6 seconds (the total of two parameterizations - with vnodes and
tables).

This patch reduces it to 0.9 seconds.

The fix is simple: Unfortunately, tests that need to wait for actual
TTL expiration take time, but the test framework configures the TTL
scanner to have a period of half a second, so the wait should be on
average around 0.25 seconds. But the test code by mistake slept 1.2
seconds between retries. We even had a good "sleep" variable for the
amount of time we should sleep between retries, but forgot to use it.

So after lowering the sleep between retries, this test is still not
instantenous - it still needs to wait up to 0.5 seconds for the
expirations to occur - but it's almost 3 times faster than before.

While working on this test, I also used the opportunity to update its
comment which excused why we are testing LSI and not GSI. Its
suggestions of what is planned for GSI have already become a reality,
so let's update the comment to say so.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25386
2025-08-14 11:21:52 +03:00
Pavel Emelyanov
eaec7c9b2e Merge 'cql3: add default replication strategy to create_keyspace_statement' from Dario Mirovic
When creating a new keyspace, both replication strategy and replication
factor must be stated. For example:
`CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };`

This syntax is verbose, and in all but some testing scenarios
`NetworkTopologyStrategy` is used.

This patch allows skipping replication strategy name, filling it with
`NetworkTopologyStrategy` when that happens. The following syntax is now
valid:
`CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };`
and will give the same result as the previous, more explicit one.

Fixes https://github.com/scylladb/scylladb/issues/16029

Backport is not needed. This is an enhancement for future releases.

Closes scylladb/scylladb#25236

* github.com:scylladb/scylladb:
  docs/cql: update documentation for default replication strategy
  test/cqlpy: add keyspace creation default strategy test
  cql3: add default replication strategy to `create_keyspace_statement`
2025-08-14 11:18:36 +03:00
Andrzej Jackowski
bf8be01086 test: audit: add logging of get_audit_log_list and set_of_rows_before
Without those logs, analysing some test failures is difficult.

Refs: scylladb/scylladb#25442

Closes scylladb/scylladb#25485
2025-08-14 09:53:05 +03:00
Ernest Zaslavsky
dd51e50f60 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452
2025-08-14 09:52:10 +03:00
Michał Chojnowski
72818a98e0 tests/lib: extract nondeterministic_choice_stack to test/lib
This util will be used in another test file in later commit,
so hoist it to `test/lib`.
2025-08-14 02:06:34 +02:00
Michał Chojnowski
0ffe336887 sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file
In a later commit, this concept will be used in a place that's not
dependent on trie traversal routines. So extract it to its own header.
2025-08-14 02:06:34 +02:00
Michał Chojnowski
30dad06c9a sstables/mx: move clustering_info from writer.cc to types.hh
We will use this type as the input to the BTI row index writer.
Since it will be implemented in other translation units,
the definition of the type has to be moved to a header.
2025-08-14 02:06:33 +02:00
Michał Chojnowski
347e5c534a sstables/trie: allow comparable_bytes_iterator to return a mutable span
`comparable_bytes_iterator` is a concept for iterating over the
fragments of a key translated to BTI encoding.
In `trie_traversal.hh`, those fragments are
`std::span<const std::byte>`, because the traversal routines
have no use for modifying the fragments.

But in a later commit we will also have to deal with encoded
keys during row index writes, and the row index writer will want
to modify the bytes, to nudge the mismatch byte by one in order
to obtain a key separator.

Let's extend this concept to allow both span<const byte>
and span<byte>, so that it can be used in both situations.
2025-08-14 01:54:57 +02:00
Michał Chojnowski
4fb841346b dht/ring_position: add ring_position_view::weight()
This will be useful for the translation of ring positions
to BTI encoding.
We will use it in a later commit.
2025-08-14 01:54:57 +02:00
Wojciech Mitros
2ece08ba43 test: run mv tests depending on metrics on a standalone instance
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which  runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.

The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.

Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.

Fixes https://github.com/scylladb/scylladb/issues/20379

Closes scylladb/scylladb#25209
2025-08-13 15:08:50 +03:00
Petr Gusev
3f287275b8 test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty 2025-08-13 14:03:57 +02:00
Petr Gusev
8bd936b72c storage_proxy: preserve accept error messages 2025-08-13 13:43:12 +02:00
Petr Gusev
00c25d396f storage_proxy: preserve prepare error message 2025-08-13 13:43:12 +02:00
Petr Gusev
0724fafe47 storage_proxy: fix log message 2025-08-13 13:40:09 +02:00
Petr Gusev
ffaee20b62 exceptions.hh: fix message argument passing
The message argument is usually taken from a temporary variable
constructed with the format() function. It is more efficient to
pass it by value and move it along the constructor chain.
2025-08-13 13:39:52 +02:00
Benny Halevy
50abeb1270 locator: util: optimize describe_ring
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint
information in an unordered_map instead of looking
them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per
node, yielding 11520 ranges and 9 replicas per range, describe_ring took
Before: 30 milliseconds (2.6 microseconds per range)
After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test of describe_ring for tablets.
A unit test for vnodes already exists in
test/nodetool/test_describering.py

Fixes #24887

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:42:25 +03:00
Benny Halevy
60d2cc886a locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas
First, let get_all_ranges return all vnode ranges
with a corrected wrapping range covering the [last token, first token)
range, such that all ranges start tokens are vndoe tokens
and must be in the vnode replication map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:42:23 +03:00
Benny Halevy
195d02d64e vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map
Prevent a crash, especially in the is_vnode=true case,
if the key_token is not found in the map.
Rather than the undefined behavior when dereferencing the
end() iterator, throw an internal error with additional
logging about the search logic and parameters.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:41:03 +03:00
Benny Halevy
4d646636f2 locator: effective_replication_map: get_natural_replicas: get is_vnode param
Some callers, like `construct_range_to_endpoint_map` for describe_ring,
or `get_secondary_ranges` for alternator ttl pass vnode tokens (the
vnodes' start token), and therefore can benefit from the fast lookup
path in `vnode_effective_replication_map::do_get_replicas`.
Otherwise the vnode token is binary-searched in sorted_tokens using
token_metadata::first_token().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:41:00 +03:00
Benny Halevy
f22a870a04 test: cluster: test_repair: add test_vnode_keyspace_describe_ring
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:39:40 +03:00
Yaniv Michael Kaul
b75799c21c skip instead of xfail test_change_replication_factor_1_to_0
It's a waste of good machine time to xfail this rather than just skip.
It takes >3m just to run the test and xfail.
We have a marker for it, we know why we skip it.

Fixes: https://github.com/scylladb/scylladb/issues/25310
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#25311
2025-08-13 10:32:22 +02:00
Ernest Zaslavsky
380c73ca03 s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469
2025-08-13 08:48:55 +03:00
Jenkins Promoter
2de91d43d5 Update pgo profiles - x86_64 2025-08-13 07:52:17 +03:00
Jenkins Promoter
647d9fe45d Update pgo profiles - aarch64 2025-08-13 07:43:38 +03:00
Dario Mirovic
2ac37b4fde docs/cql: update documentation for default replication strategy
Update create-keyspace-statement section of ddl.rst since `class` is no longer mandatory.
Add an example for keyspace creation without specifying `class`.

Refs: #16029
2025-08-13 01:52:00 +02:00
Dario Mirovic
ef63d343ba test/cqlpy: add keyspace creation default strategy test
Add a test case for create keyspace default replication strategy.
It is expected that the default replication strategy is `NetworkTopologyStrategy`.

Refs: #16029
2025-08-13 01:52:00 +02:00
Dario Mirovic
bc8bb0873d cql3: add default replication strategy to create_keyspace_statement
When creating a new keyspace, both replication strategy and replication
factor must be stated. For example:
`CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };`

This syntax is verbose, and in all but some testing scenarios
`NetworkTopologyStrategy` is used.

This patch allows skipping replication strategy name, filling it with
`NetworkTopologyStrategy` when that happens. The following syntax is now
valid:
`CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };`
and will give the same result as the previous, more explicit one.

Fixes #16029
2025-08-13 01:51:53 +02:00
Botond Dénes
72b2bbac4f pgo/pgo.py: use tablet repair API for repair
Since a1d7722 tablet keyspaces are not allowed to be repaired via the
old /storage_service/repair_async/{keyspace} API, instead the new
/storage_service/tablets/repair API has to be used. Adjust the repair
code and also add await_completion=true: the script just waits
for the repair to finish immediately after starting it.

Closes scylladb/scylladb#25455
2025-08-12 20:32:19 +03:00
Petr Gusev
ff89c03c7f exceptions: add constructors that accept explicit error messages
To improve debuggability, we need to propagate original error messages
from Paxos verbs to the user. This change adds constructors that take
an error message directly, enabling better error reporting.

Additionally, functions such as write_timeout_to_read,
write_failure_to_read etc are updated to use these message-based
constructors. These functions are used in storage_proxy::cas to
convert between different error types, and without this change,
they could lose the original error message during conversion.
2025-08-12 16:31:05 +02:00
Taras Veretilnyk
b7097b2993 database_test: fix abandoned futures in test_drop_quarantined_sstables
The lambda passed to do_with_cql_env_thread() in test_drop_quarantined_sstables
was mistakenly written as a coroutine.
This change replaces co_await with .get() calls on futures
and changes lambda return type to void.

Fixes scylladb/scylladb#25427

Closes scylladb/scylladb#25431
2025-08-12 13:31:06 +03:00
Patryk Jędrzejczak
a1b2f99dee Merge 'test: test_mv_backlog: fix to consider internal writes' from Michael Litvak
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.

The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.

The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.

Fixes https://github.com/scylladb/scylladb/issues/23139

backport to improve CI stability - test only change

Closes scylladb/scylladb#25279

* https://github.com/scylladb/scylladb:
  test: test_mv_backlog: fix to consider internal writes
  test/pylib/rest_client: fix ScyllaMetrics filtering
2025-08-12 10:05:15 +02:00
Wojciech Przytuła
7600ccfb20 Fix link to ScyllaDB manual
The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs.

Closes scylladb/scylladb#25328
2025-08-12 10:33:06 +03:00
Avi Kivity
ac1f6aa0de auth: resource: simplify some range transformations
Supply the member function directly to std::views::transform,
rather than going through a lambda.

Closes scylladb/scylladb#25419
2025-08-12 10:30:06 +03:00
Karol Nowacki
22a133df9b service/vector_store_client: Add live configuration update support
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.
2025-08-12 08:12:53 +02:00
Karol Nowacki
152274735e test/boost/vector_store_client_test.cc: Refactor vector store client test
Consolidate consecutive setup functions into a dedicated helper.
Extract test table creation into a separate function.
Remove redundant assertions to improve clarity.
2025-08-12 08:12:53 +02:00
Karol Nowacki
858c423501 service/vector_store_client: Refactor host_port struct created
This new struct groups the host and port.
2025-08-12 08:12:53 +02:00
Karol Nowacki
dd147cd8e5 service/vector_store_client: Refactor HTTP request creation
Introduce lightweight wrapper for seastar::http::experimental::client
This wrapper simplifies request creation by automatically injecting the host name.
2025-08-12 08:12:53 +02:00
Tomasz Grabiec
9fd312d157 Merge 'row_cache: add memtable overlap checks elision optimization for tombstone gc' from Botond Dénes
https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache.

When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications:
* The partition the tombstone is part of was already repaired at the time the memtable was created.
* All writes in the memtable were produced *after* this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone.

Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency.

Fixes: https://github.com/scylladb/scylladb/issues/24962

Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well

Closes scylladb/scylladb#25033

* github.com:scylladb/scylladb:
  docs/dev/tombstone.md: document the memtable overlap check elision optimization
  test/boost/row_cache_test: add test for memtable overlap check elision
  db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
  mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
  db/cache_mutation_reader: use max_purgeable::can_purge()
  replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
  replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
  compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
  replica/table: propagate gc_state to memtable_list
  replica/memtable_list: add tombstone_gc_state* member
  replica/memtable: add tombstone_gc_state_snapshot
  tombstone_gc: introduce tombstone_gc_state_snapshot
  tombstone_gc: extract shared state into shared_tombstone_gc_state
  tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
  tombstone_gc: fold get_group0_gc_time() into its caller
  tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
  tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
  tombstone_gc: refactor get_or_greate_repair_history_for_table()
  replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
  db/read_context: return max_purgeable from get_max_purgeable()
  compaction/compaction_garbage_collector: add formatter for max_purgeable
  mutation: move definition of gc symbols to compaction.cc
  compaction/compaction_garbage_collector: refactor max_purgeable into a class
  test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
  test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
  test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
2025-08-11 23:54:59 +02:00
Michał Chojnowski
3017dbb204 sstables/trie: add trie traversal routines
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.

This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.

Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.

Closes scylladb/scylladb#25396
2025-08-11 19:15:09 +03:00
Botond Dénes
660ea9202a docs/dev/tombstone.md: document the memtable overlap check elision optimization 2025-08-11 17:20:12 +03:00
Botond Dénes
65c770f21a test/boost/row_cache_test: add test for memtable overlap check elision 2025-08-11 17:20:12 +03:00
Botond Dénes
7adbb1bd17 db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
Obtaining the gc-before time, or the min-live timestamps (with the
expiry threshold) is not always trivial, so defer it until we know it is
needed. Not all reads will attempt to garbage-collect tombstones, these
reads can now avoid this work.
The downside is that the partition key has to be copied and stored, as
it is necessary for obtaining the min-live timestamp later.
2025-08-11 17:20:12 +03:00
Botond Dénes
f4b0c384fb mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
Use the optimized can_purge() check instead of the old stricter
direct timestamp comparison method.
2025-08-11 17:20:12 +03:00
Botond Dénes
92e8d2f9b2 db/cache_mutation_reader: use max_purgeable::can_purge()
Use the optimized can_purge() check instead of the old stricter
direct timestamp comparison method.
2025-08-11 17:20:12 +03:00
Botond Dénes
4e15d32151 replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
To combine the max purgable values, instead of just combining the
timestamp values. The former way is still correct, but loses the
timestamp explosion optimization, which allows the cache reader to drop
timestamps from the overlap checks.
2025-08-11 17:20:12 +03:00
Botond Dénes
bd32d41cad replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
Use the newly introduced expiry_treshold field of max_purgeable, to help
exclude memtables from the overlap check if possible.
2025-08-11 17:20:12 +03:00
Botond Dénes
cfac9691ff compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak
e14c5e3890 Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky
raft: enforce odd number of voters in group0

Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least three groups).

Fixes: scylladb/scylladb#23266

No backport: This is a new change that is to be only deployed in the new version, so it will not be backported.

Closes scylladb/scylladb#25332

* https://github.com/scylladb/scylladb:
  raft: enforce odd number of voters in group0
  test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
  test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
2025-08-11 15:44:21 +02:00
Benny Halevy
23ac80fc6b utils: stall_free: detect clear_gently method of const payload types
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
    using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
    using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```

That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.

This change still doesn't allow directly calling utils::clear_gently
an const objects.

And respective unit tests.

Fixes #24605

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-11 14:22:01 +03:00
Benny Halevy
cb9db2f396 utils: stall_free: clear gently a foreign shared ptr only when use_count==1
Unlike clear_gently of SharedPtr, clear_gently of a
`foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object
even if it's still shared and may still be in use.

This change examines the foreign shared pointer's use_count
and calls clear_gently on the shard object only when
its use_count reaches 1.

Fixes #25026

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-11 14:21:32 +03:00
Tomasz Grabiec
f7c001deff Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity
I noticed clustering_bounds_comparator was running an unnecessary
thread_local initialization guard. This series switches the variable
to constinit initialization, removing the guard.

Performance measurements (perf-simple-query) show an unimpressive
20 instruction per op reduction. However, each instruction counts!

Before:

```
throughput:
	mean=   203642.54 standard-deviation=1102.99
	median= 204328.69 median-absolute-deviation=955.56
	maximum=204624.13 minimum=202222.19
instructions_per_op:
	mean=   42097.59 standard-deviation=40.07
	median= 42111.83 median-absolute-deviation=30.65
	maximum=42139.88 minimum=42044.91
cpu_cycles_per_op:
	mean=   22664.81 standard-deviation=131.28
	median= 22581.10 median-absolute-deviation=111.57
	maximum=22832.30 minimum=22553.24
```

After:

```
throughput:
	mean=   204397.73 standard-deviation=2277.71
	median= 204942.95 median-absolute-deviation=2191.54
	maximum=207588.30 minimum=202162.80
instructions_per_op:
	mean=   42087.21 standard-deviation=27.30
	median= 42092.75 median-absolute-deviation=20.33
	maximum=42108.33 minimum=42041.51
cpu_cycles_per_op:
	mean=   22589.79 standard-deviation=219.24
	median= 22544.82 median-absolute-deviation=191.98
	maximum=22835.11 minimum=22303.52
```

(Very) minor performance improvement, no backport suggestd.

Closes scylladb/scylladb#25259

* github.com:scylladb/scylladb:
  keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit
  keys: make empty creation clustering_key_prefix constexpr
  managed_bytes: make empty managed_bytes constexpr friendly
  keys: clustering_bounds_comparator: make _empty_prefix a prefix
2025-08-11 13:20:38 +02:00
Anna Stuchlik
1322f301f6 doc: add support for RHEL 10
This commit adds RHEL 10 to the list of supported platforms.

Fixes https://github.com/scylladb/scylladb/issues/25436

Closes scylladb/scylladb#25437
2025-08-11 13:13:37 +02:00
Israel Fruchter
2da26d1fc1 Update tools/cqlsh submodule (v6.0.26)
* tools/cqlsh 02ec7c57...aa1a52c1 (6):
  > build-push.yaml: upgrade cibuildwheel to latest
  > build-push.yml: skip python 3.8 and PyPy builds
  > cqlshlib: make NetworkTopologyStrategy default for autocomplete
  > default to setuptools_scm based version when not packaged
  > chore(deps): update pypa/cibuildwheel action to v2.23.0

Closes scylladb/scylladb#25420
2025-08-11 13:07:47 +03:00
Artsiom Mishuta
dac04a5b97 fix(test.py) incorrect markers argument in boost tests
pytest parkers argument can be space separated like "not unstable"
to pass such argument propperly in CLI(bash) command we should use double quates
due to using  shlex.split with space separation
while we are not support markers in C++ tests we are passig all pytest
arguments

tested locally on command:
./tools/toolchain/dbuild  ./test.py --markers="not unstable" test/boost/auth_passwords_test.cc

before change: no tests ran in 1.12s
after: 8 passed in 2.45s

Closes scylladb/scylladb#25394
2025-08-11 10:43:34 +03:00
Patryk Jędrzejczak
7b77c6cc4a docs: Raft recovery procedure: recommend verifying participation in Raft recovery
This instruction adds additional safety. The faster we notice that
a node didn't restart properly, the better.

The old gossip-based recovery procedure had a similar recommendation
to verify that each restarting node entered `RECOVERY` mode.

Fixes #25375

This is a documentation improvement. We should backport it to all
branches with the new recovery procedure, so 2025.2 and 2025.3.

Closes scylladb/scylladb#25376
2025-08-11 09:21:29 +03:00
Avi Kivity
f49b63f696 tools: toolchain: dbuild: forward container registry credentials
Docker hub rate-limits unauthenticated image pulls, so forward
the host's credentials to the container. This prevents rate limit
errors when running nested containers.

Try the locations for the credentials in order and bind-mount the
first that exists to a location that gets picked up.

Verified with `podman login --get-login docker.io` in the container.

Closes scylladb/scylladb#25354
2025-08-11 09:05:57 +03:00
Botond Dénes
3b1f414fcf replica/table: propagate gc_state to memtable_list 2025-08-11 07:09:19 +03:00
Botond Dénes
9d00d7e08d replica/memtable_list: add tombstone_gc_state* member
To be passed down to the memtable.
2025-08-11 07:09:19 +03:00
Botond Dénes
ef8a21b4cf replica/memtable: add tombstone_gc_state_snapshot
To be used for possibly excluding the memtable from overlap checks with
the cache/sstables, in memtable_list::get_max_purgeable().
2025-08-11 07:09:19 +03:00
Botond Dénes
ab633590f1 tombstone_gc: introduce tombstone_gc_state_snapshot
Returns gc-before times, identical to what tombstone_gc_state would have
returned at the point of taking the snapshot.
2025-08-11 07:09:14 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Botond Dénes
3a54379330 tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
No reason for it to be a shared pointer, or even a pointer at all. When
the pointer is not initialized, gc_clock::time_point::min() is used as
the group0 gc time, so we can just replace with a gc_clock::time_point
value initialized to min() and do away with an unnecessary indirection
as well as an allocation. This latter will be even more important after
the next patches.
2025-08-11 07:09:14 +03:00
Botond Dénes
aa43396aac tombstone_gc: fold get_group0_gc_time() into its caller
It has just one caller. This fold makes the code simpler and facilitates
further patching.
2025-08-11 07:09:14 +03:00
Botond Dénes
faa2b5b4d4 tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
Its only caller. Makes the code simpler and facilitates further
patching.
2025-08-11 07:09:13 +03:00
Botond Dénes
e9d211bbcd tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
Its only caller. Makes the code simpler and facilitates further
patching.
2025-08-11 07:09:13 +03:00
Botond Dénes
b9f0cabead tombstone_gc: refactor get_or_greate_repair_history_for_table()
This method has 3 lookups into the reconcile history maps in the worst
case. Reduce to just one. Makes the code more streamlined and prepares
the groundwork for the next patch.
2025-08-11 07:09:13 +03:00
Botond Dénes
1d3a3163a3 replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
Also change to the return type to max_purgeable, instead of raw
timestamp. Prepares for further patching of this code.
2025-08-11 07:09:13 +03:00
Botond Dénes
5d69ef5e8b db/read_context: return max_purgeable from get_max_purgeable()
Instead of just the timestamp. Soon more fields will be used.
2025-08-11 07:09:13 +03:00
Botond Dénes
1d2cc6ef12 compaction/compaction_garbage_collector: add formatter for max_purgeable
It is more than just a timestamp already, and it is about to receive
some additional fields.
2025-08-11 07:09:13 +03:00
Botond Dénes
6078c15116 mutation: move definition of gc symbols to compaction.cc
We are used to symbols definition being grouped in one .cc file, but a
symbol declaration and definition living in separate modules
(subfolders) is surprising.
Relocate always_gc, never_gc, can_always_purge and can_never_purge to
compaction/compaction.cc, from mutatiobn/mutation_partition.cc. The
declarations of these symbols is in
compaction/compaction_garbage_collector.hh.
2025-08-11 07:09:13 +03:00
Botond Dénes
ef7d49cd21 compaction/compaction_garbage_collector: refactor max_purgeable into a class
Make members private, add getters and constructors.
This struct will get more functionality soon, so class is a better fit.
2025-08-11 07:09:13 +03:00
Botond Dénes
c150bdd59c test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
This test currently uses gc_grace_seconds=0. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Botond Dénes
c052f2ad1d test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
This test will soon need to be changed to use tombstone-gc=repair. This
cannot work as of now, as the test uses a single-node cluster.
The options are the following:
* Make it use more than one nodes
* Make repair work with single node clusters
* Rewrite in C++ where repair can be done synthetically

We chose the last option, it is the simplest one both in terms of code
and runtime footprint.

The new test is in test/boost/row_cache_test.cc
Two changes were done during the migration
* Change the name to
  test_populating_reader_tombstone_gc_with_data_in_memtable
  to better express which cache component this test is targetting;
* Use NullCompactionStrategy on the table instead of disabling
  auto-compaction.
2025-08-11 07:09:13 +03:00
Botond Dénes
e4c048ada1 test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
These tests currently use tombstone-gc=immediate. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Asias He
2ecd42f369 feature_service: Add TABLET_INCREMENTAL_REPAIR feature 2025-08-11 10:10:08 +08:00
Asias He
b226ad2f11 tablet_allocator: Add tablet_force_tablet_count_increase and decrease
It is useful to increase and decrease the tablet count in the test for
tablet split and merge testing.
2025-08-11 10:10:08 +08:00
Asias He
1bf59ebba0 repair: Add incremental helpers
This adds the helpers which are needed by both repair and compaction to
add incremental repair support.
2025-08-11 10:10:08 +08:00
Asias He
b86f554760 sstable: Add being_repaired to sstable
This in-memory filed is set by incremental repair when the sstable
participates the repair.
2025-08-11 10:10:08 +08:00
Asias He
f50cd94429 sstables: Add set_repaired_at to metadata_collector 2025-08-11 10:10:08 +08:00
Asias He
ac9d33800a mutation_compactor: Introduce add operator to compaction_stats
It is needed to combine two compactions.
2025-08-11 10:10:07 +08:00
Asias He
5377f87e5a tablet: Add sstables_repaired_at to system.tablets table
It is used to store the repaired_at for each tablet.
2025-08-11 10:10:07 +08:00
Asias He
8db18ac74e test: Fix drain api in task_manager_client.py
The POST method should be used.
2025-08-11 10:10:07 +08:00
Avi Kivity
6daa6178b1 scripts: pull_github_pr.sh: reject unintended submodule changes
It is easy for submodule changes to slip through during rebase (if
the developer uses the terrible `git add -u`  command) and
for a maintainer to miss it (if they don't go over each change after
a rebase).

Protect against such mishaps by checking if a submodule was updated
(or .gitmodules itself was changes) and aborting the operation.

If the pull request title contains "submodule", assume the operation
was intended.

Allow bypassing the check with --allow-submodule.

Closes scylladb/scylladb#25418
2025-08-10 11:48:34 +03:00
Michael Litvak
276a09ac6e test: test_mv_backlog: fix to consider internal writes
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.

However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.

To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.

Fixes scylladb/scylladb#23139
2025-08-10 10:31:02 +02:00
Michael Litvak
5c28cffdb4 test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.
2025-08-10 10:16:00 +02:00
Avi Kivity
c2a2e11c40 Merge 'Prepare the way for incremental repair' from Botond Dénes
With incremental repair, each replica::compaction_group will have 3 logical compaction groups, repaired, repairing and unrepaired. The definition of group is a set of sstables that can be compacted together. The logical groups will share the same instance of sstable_set, but each will have its own logical sstable set. Existing compaction::table_state is a view for a logical compaction group. So it makes sense that each replica::compaction_group will have multiple views. Each view will provide to compaction layer only the sstables that belong to it. That way, we preserve the existing interface between replica and compaction layer, where each compaction::table_state represents a single logical group.
The idea is that all the incremental repair knowledge is confined to repair and replica layer, compaction doesn't want to know about it, it just works on logical groups, what each represents doesn't matter from the perspective of the subsystem. This is the best way forward to not violate layers and reduce the maintenance burden in the long run.
We also proceed to rename table_state to compaction_group_view, since it's a better description. Working with multiple terms is confusing. The placeholder for implementing the sstable classifier is also left in tablet_storage_group_manager, by the time being, all sstables will go to the unrepaired logical set, which preserves the current behavior.

New functionality, no backport required

Closes scylladb/scylladb#25287

* github.com:scylladb/scylladb:
  test: Add test that compaction doesn't cross logical group boundary
  replica: Introduce views in compaction_group for incremental repair
  compaction: Allow view to be added with compaction disabled
  replica: Futurize retrieval of sstable sets in compaction_group_view
  treewide: Futurize estimation of pending compaction tasks
  replica: Allow compaction_group to have more than one view
  Move backlog tracker to replica::compaction_group
  treewide: Rename table_state to compaction_group_view
  tests: adjust for incremental repair
2025-08-09 17:21:17 +03:00
Emil Maskovsky
7c54401d3d raft: enforce odd number of voters in group0
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).

Fixes: scylladb/scylladb#23266
2025-08-08 19:49:20 +02:00
Emil Maskovsky
29ddb2aa18 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
The test_lwt_timeout_while_creating_paxos_state_table was failing after
implementing odd number voter enforcement in the group0 voter calculator.

Previously with 2 nodes:
- 2 nodes → 2 voters → stop 1 node → 1/2 voters (no quorum) → expected Raft timeout

With odd voter count enforcement:
- 2 nodes → 1 voter → stop 1 node → 0/1 voters → Cassandra availability error

This change updates the test to use 3 nodes instead of 2, ensuring proper
no-quorum scenarios:
- 3 nodes → 3 voters → stop 2 nodes → 1/3 voters (no quorum) → Raft timeout

The test now correctly validates LWT timeout behavior while being compatible
with the odd number voter enforcement requirement.
2025-08-08 19:49:10 +02:00
Emil Maskovsky
7fc75aff3e test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
Update the no-quorum cluster tests to work correctly with the new odd
number voter enforcement in the group0 voter calculator. The tests now
properly account for the changed voter counts when validating no-quorum
scenarios.
2025-08-08 19:48:58 +02:00
Anna Stuchlik
f3d9d0c1c7 doc: add new and removed metrics to the 2025.3 upgrade guide
This commit adds the list of new and removed metrics to the already existing upgrade guide
from 2025.2 to 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24697

Closes scylladb/scylladb#25385
2025-08-08 13:25:51 +02:00
Avi Kivity
ab45a0edb5 Update seastar submodule
* seastar 60b2e7da...1520326e (36):
  > Merge 'http/client: Fix content length body overflow check (and a bit more)' from Pavel Emelyanov
    test/http: Add test for http_content_length_data_sink
    test/http: Implement some missing methods for memory data sink
    http/client: Fix content length body overflow check
    http/client: Fix misprint in overflow exception message
  > dns: Use TCP connection data_sink directly
  > iostream: Update "used stream" check for output_stream::detach()
  > Update dpdk submodule
  > rpc: server::process: coroutinize
  > iostream: Remove deprecated constructor
  > Merge 'foreign_ptr: add unwrap_on_owner_shard method' from Benny Halevy
    foreign_ptr: add unwrap_on_owner_shard method
    foreign_ptr: release: check_shard with SEASTAR_DEBUG_SHARED_PTR
  > enum: Replace static_assert() with concept
  > rpc: reindent connection::negotiate()
  > rpc: client: use structured binding
  > rpc.cc: reindent
  > queue: Remove duplicating static assertion
  > Merge 'rpc: client: convert main loop to a coroutine' from Avi Kivity
    rpc: client::loop(): restore indentation
    rpc: client: coroutinize client::loop()
    rpc: client: split main loop function
  > Merge 'treewide: replace remaining std::enable_if with constraints' from Avi Kivity
    optimized_optional: replace std::enable_if with constraint
    log: replace std::enable_if with constraint
    rpc: replace std::enable_if with constraint
    when_all: replace std::enable_if with constraints
    transfer: replace std::enable_if with constraints
    sstring: replace std::enable_if with constraint
    simple-stream: replace std::enable_if with constraints
    shared_ptr: replace std::enable_if with constraints
    sharded: replace std::enable_if with constraints for sharded_has_stop
    sharded: replace std::enable_if with constraints for peering_sharded_service
    scollectd: replace std::enable_if with constraints for type inference
    scollectd: replace std::enable_if with constraints for ser/deser
    metrics: replace std::enable_if with constraints
    chunked_fifo: replace std::enable_if with constraint
    future: replace std::enable_if with constraints
  > websocket: Avoid sending scattered_message to output_stream
  > websocket: Remove unused scattered_message.hh inclusion
  > aio: Squash aio_nowait_supported into fs_info::nowait_works
  > Merge 'reactor: coroutinize spawn()' from Avi Kivity
    reactor: restore indentation for spawn()
    reactor: coroutinize spawn()
  > modules: export coroutine facilities
  > Merge 'reactor: coroutinize some file-related functions' from Avi Kivity
    reactor: adjust indentation
    reactor: coroutinize reactor::make_pipe()
    reactor: coroutinize reactor::inotify_add_watch()
    reactor: coroutinize reactor::read_directory()
    reactor: coroutinize reactor::file_type()
    reactor: coroutinize reactor::chmod()
    reactor: coroutinize reactor::link_file()
    reactor: coroutinize reactor::rename_file()
    reactor: coroutinize open_file_dma()
  > memory: inline disable_abort_on_alloc_failure_temporarily
  > Merge 'addr2line timing and optimizations' from Travis Downs
    addr2line: add basic timing support
    addr2line: do a quick check for 0x in the line
    addr2line: don't load entire file
    addr2line: typing fixing
  > posix: Replace static_assert with concept
  > tls: Push iovec with the help of put(vector<temporary_buffer>)
  > io_queue: Narrow down friendship with reactor
  > util: drop concepts.hh
  > reactor: Re-use posix::to_timespec() helper
  > Fix incorrect defaults for io queue iops/bandwidth
  > net: functions describing ssl connection
  > Add label values to the duplicate metrics exception
  > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov
    test: Add unit test for cross-sched-groups wakeups
    test: Add unit test for fair CPU scheduling
    test: Add unit test for basic supergrops manipulations
    test: Add perf test for context switch latency
    scheduling: Add an internal method to get group's supergroup
    reactor: Add supergroup get_shares() API
    reactor: Add supergroup::set_shares() API
    reactor: Create scheduling groups in supergroups
    reactor: Supergroups destroying API
    reactor: Supergroups creating API
    reactor: Pass parent pointer to task_queue from caller
    reactor: Wakeup queue group on child activation
    reactor: Add pure virtual sched_entity::run_tasks() method
    reactor: Make task_queue_group be sched_entity too
    reactor: Split task_queue_group::run_some_tasks()
    reactor: Count and limit supergroup children
    reactor: Link sched entity to its parent
    reactor: Switch activate(task_queue*) to work on sched_entity
    reactor: Move set_shares() to sched_entity()
    reactor: Make account_runtime() work with sched_entity
    reactor: Make insert_activating_task_queue() work on sched_entity
    reactor: Make pop_active_task_queue() work on sched_entity
    reactor: Make insert_active_task_queue() work on sched_entity
    reactor: Move timings to sched_entity
    reactor: Move active bit to sched_entity
    reactor: Move shares to sched_entity
    reactor: Move vruntime to sched_entity
    reactor: Introduce sched_entity
    reactor: Rename _activating_task_queues -> _activating
    reactor: Remove local atq* variable
    reactor: Rename _active_task_queues -> _active
    reactor: Move account_runtime() to task_queue_group
    reactor: Move vruntime update from task_queue into _group
    reactor: Simplify task_queue_group::run_some_tasks()
    reactor: Move run_some_tasks() into task_queue_group
    reactor: Move insert_activating_task_queues() into task_queue_group
    reactor: Move pop_active_task_queue() into task_queue_group
    reactor: Move insert_active_task_queue() into task_queue_group
    reactor: Introduce and use task_queue_group::activate(task_queue)
    reactor: Introduce task_queue_group::active()
    reactor: Wrap scheduling fields into task_queue_group
    reactor: Simplify task_queue::activate()
    reactor: Rename task_queue::activate() -> wakeup()
    reactor: Make activate() method of class task_queue
    reactor: Make task_queue::run_tasks() return bool
    reactor: Simplify task_queue::run_tasks()
    reactor: Make run_tasks() method of class task_queue
  > Fix hang in io_queue for big write ioproperties numbers
  > split random io buffer size in 2 options
  > reactor: document run_in_background
  > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar
    Add unit test for validating computed params in io_queue
    Move `disk_params` and `disk_config_params` to their own unit
    Add an overload for `disk_config_params::generate_config`

Closes scylladb/scylladb#25404
2025-08-08 12:24:39 +03:00
Benny Halevy
49e3b2827f streaming: stream_blob: use the table sstable_generation_generator
No need to start a local generator.
Can just use the table's sstable generation generator
to make new sstables now that it's stateless and doesn't
depend on the highest generation found.

Note that tablet_stream_files_handler used uuid generations
unconditionally from inception
(4018dc7f0d).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
de8a199f79 replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
No need to start a local sharded generator.
Can just use the table's sstable generation generator
to make new sstables now that it's stateless and doesn't
depend on the highest generation found (including the uploaded
sstables).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
13f4e27cb9 sstables: sstable_generation_generator: stop tracking highest generation
It is unused by now.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
0a20834d2a replica: table: get rid of update_sstables_known_generation
It is not needed anymore.
With that database::_sstable_generation_generator can
be a regular member rather than optional and initialized
later.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
42cb25c470 sstables: sstable_directory: stop tracking highest_generation
It is not needed anymore as we always generate
uuid generations.

Convert sstable_directory_test_table_simple_empty_directory_scan
to use the newly added empty() method instead of
checking the highest generation seen.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
b01524c5a3 replica: distributed_loader: stop tracking highest_generation
It is not needed anymore as we always generate
uuid generations.

Move highest_generation_seen(sharded<sstables::sstable_directory>& directory)
to sstables/sstable_directory module.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
6cc964ef16 sstables: sstable_generation: get rid of uuid_identifiers bool class
Now that all call sites enable uuid_identifiers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
43ee9c0593 sstables_manager: drop uuid_sstable_identifiers
It is returning constant sstables::uuid_identifiers::yes now,
so let the callers just use the constant (to be dropped
in a following patch).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
0ad1898f0a feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
The feature is supported by all live versions since
version 5.4 / 2024.1.

(Although up to 6da758d74c
it could be disabled using the config option)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:15 +03:00
Botond Dénes
70aa81990b Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El
In commit 44a1daf we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible

This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy).

Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write"

No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low.

Fixes #12348

Closes scylladb/scylladb#19147

* github.com:scylladb/scylladb:
  test/alternator: utility functions for changing configuration
  alternator: add optional support for writing to system table
  test/alternator: reduce duplicated code
2025-08-08 09:13:15 +03:00
Raphael S. Carvalho
beaaf00fac test: Add test that compaction doesn't cross logical group boundary
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:01 +03:00
Raphael S. Carvalho
d351b0726b replica: Introduce views in compaction_group for incremental repair
Wired the unrepaired, repairing and repaired views into compaction_group.

Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.

Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.

Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.

From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.

The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
61cb02f580 compaction: Allow view to be added with compaction disabled
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
20c3301a1a treewide: Futurize estimation of pending compaction tasks
This is to allow futurization of compaction_group_view method that
retrieves sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
af3592c658 replica: Allow compaction_group to have more than one view
In order to support incremental repair, we'll allow each
replica::compaction_group to have two logical compaction groups
(or logical sstable sets), one for repaired, another for unrepaired.

That means we have to adapt a few places to work with
compaction_group_view instead, such that no logical compaction
group is missed when doing table or tablet wide operations.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
e78295bff1 Move backlog tracker to replica::compaction_group
Since there will be only one physical sstable set, it makes sense to move
backlog tracker to replica::compaction_group. With incremental repair,
it still makes sense to compute backlog accounting both logical sets,
since the compound backlog influences the overall read amplification,
and the total backlog across repaired and unrepaired sets can help
driving decisions like giving up on incremental repair when unrepaired
set is almost as large as the repaired set, causing an amplification
of 2.

Also it's needed for correctness because a sstable can move quickly
across the logical sets, and having one tracker for each logical
set could cause the sstable to not be erased in the old set it
belonged to;

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Asias He
acc367c522 tests: adjust for incremental repair
The separatation of sstables into the logical repaired and unrepaired
virtual sets, requires some adjustments for certain tests, in particular
for those that look at number of compaction tasks or number of sstables.
The following tests need adjustment:
* test/cluster/tasks/test_tablet_tasks.py
* test/boost/memtable_test.cc

The adjustments are done in such a way that they accomodate both the
case where there is separate repaired/unrepaired states and when there
isn't.
2025-08-08 06:49:17 +03:00
Andrei Chekun
5c095558b1 test.py: add timeout option for the whole run
Add possibility to limit the execution time for one test in pytest
Add --session-timeout to limit execution of the test.py or/and pytest
session

Closes scylladb/scylladb#25185
2025-08-07 21:06:14 +03:00
Avi Kivity
2b8f5d128a Merge 'GCP Key Provider: Fix authentication issues' from Nikos Dragazis
* Fix discovery of application default credentials by using fully expanded pathnames (no tildes).
* Fix grant type in token request with user credentials.

Fixes #25345.

Closes scylladb/scylladb#25351

* github.com:scylladb/scylladb:
  encryption: gcp: Fix the grant type for user credentials
  encryption: gcp: Expand tilde in pathnames for credentials file
2025-08-07 20:50:12 +03:00
Dani Tweig
0ade762654 Adding action call to update Jira issue status
Add actions that will change the relevant Jira issue status based on the linked PR changes.

Closes scylladb/scylladb#25397
2025-08-07 15:55:58 +03:00
Benny Halevy
3f44dba014 sstables: make_entry_descriptor: make regex non-greedy
With greedy matching, an sstable path in a snapshot
directory with a tag that resembles a name-<uuid>
would match the dir regular expression as the longest match,
while a non-greedy regular expression would correctly match
the real keyspace and table as the shortest match.

Also, add a regression unit test reproducing the issue and
validating the fix.

Fixes #25242

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25323
2025-08-07 15:35:11 +03:00
Avi Kivity
8164f72f6e Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.

However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.

Refs #22733

* No backport required

Closes scylladb/scylladb#25222

* github.com:scylladb/scylladb:
  locator: abstract_replication_strategy: implement local_replication_strategy
  locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
  locator: abstract_replication_map: rename make_effective_replication_map
  locator: abstract_replication_map: rename calculate_effective_replication_map
  replica: database: keyspace: rename {create,update}_effective_replication_map
  locator: effective_replication_map_factory: rename create_effective_replication_map
  locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
  locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
  keyspace: rename get_vnode_effective_replication_map
  dht: range_streamer: use naked e_r_m pointers
  storage_service: use naked e_r_m pointers
  alternator: ttl: use naked e_r_m pointers
  locator: abstract_replication_strategy: define is_local
2025-08-07 12:51:43 +03:00
Nadav Har'El
6f415b2f10 Merge 'test/cqlpy: Adjust test_describe.py to work against Cassandra' from Dawid Mędrek
We adjust most of the tests in `cqlpy/test_describe.py`
so that they work against both Scylla and Cassandra.
This PR doesn't cover all of them, just those I authored.

Refs scylladb/scylladb#11690

Backport: not needed. This is effectively a code cleanup.

Closes scylladb/scylladb#25060

* github.com:scylladb/scylladb:
  test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra
  test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra
  test/cqlpy/test_describe.py: Mark Scylla-only tests as such
2025-08-07 12:43:04 +03:00
Avi Kivity
90eb6e6241 Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski
This is the next part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25154
Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here.

The new code added here is not used for anything yet, but it's posted as a separate PR
to keep things reviewably small.

This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes.
It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR)
into the on-disk format, and the logic for traversing the on-disk nodes during a read.

New functionality, no backporting needed.

Closes scylladb/scylladb#25317

* github.com:scylladb/scylladb:
  sstables/trie: add tests for BTI node serialization and traversal
  sstables/trie: implement BTI node traversal
  sstables/trie: implement BTI serialization
  utils/cached_file: add get_shared_page()
  utils/cached_file: replace a std::pair with a named struct
2025-08-07 12:15:42 +03:00
Benny Halevy
02b922ac40 test: cql_query_test: add test_sstable_load_mixed_generation_type
Test that we can load sstables with mixed, numerical and uuid
generation types, and verify the expected data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Benny Halevy
9b65856a26 test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
It's a generic helper that can be used by all tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Benny Halevy
7c9ce235d7 test: database_test: move table_dir helper to test/lib/test_utils
It's a generic helper that can be used by all tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Nadav Har'El
d632599a92 Merge 'test.py: native pytest repeats' from Andrei Chekun
Previous way of execution repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.

Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well.

Closes scylladb/scylladb#25073

* github.com:scylladb/scylladb:
  test.py: add repeats in pytest
  test.py: add directories and filename to the log files
  test.py: rename log sink file for boost tests
  test.py: better error handling in boost facade
2025-08-06 18:18:03 +03:00
Dawid Pawlik
b284961a95 scripts: fetch the name of the author of the PR
The `pull_github_pr.sh` script has been fetching the username
from the owner of the source branch.
The owner of the branch is not always the author of the PR.
For example the branch might come from a fork managed by organization
or group of people.
This lead to having the author in merge commits refered to as `null`
(if the name was not set for the group) or it mentioned a name
not belonging to the author of the patch.

Instead looking for the owner of the source branch, the script should
look for the name of the PR's author.

Closes scylladb/scylladb#25363
2025-08-06 16:45:38 +03:00
Benny Halevy
5e5e63af10 scylla-sstable: print_query_results_json: continue loop if row is disengaged
Otherwise it is accessed right when exiting the if block.
Add a unit test reproducing the issue and validating the fix.

Fixes #25325

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25326
2025-08-06 16:44:51 +03:00
Szymon Malewski
eb11485969 test/alternator: enable more relevant logs in CI.
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.

Closes #24645

Closes scylladb/scylladb#25327
2025-08-06 16:37:25 +03:00
Nikos Dragazis
ee92fcc078 encryption_at_rest_test: Preserve tmpdir from failing KMIP tests
The KMIP tests start a local PyKMIP server and configure it to write
logs in the test's temporary directory (`tmpdir`). However, the tmpdir
is a RAII object that deletes the directory once it goes out of scope,
causing PyKMIP server logs to be lost on test failures.

To assist with debugging, preserve the whole directory if the test
failed with an exception. Allow the user to disable this by setting the
SCYLLA_TEST_PRESERVE_TMP_ON_EXCEPTION environment variable.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-08-06 16:29:19 +03:00
Benny Halevy
6dbbb80aae locator: abstract_replication_strategy: implement local_replication_strategy
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.

However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.

Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:05:11 +03:00
Benny Halevy
8bde507232 locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
create_effective_replication_map need not know about the internals of
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
8d4ac97435 locator: abstract_replication_map: rename make_effective_replication_map
to make_vnode_effective_replication_map_ptr since
it is specific to vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
babb4a41a8 locator: abstract_replication_map: rename calculate_effective_replication_map
to calculate_vnode_effective_replication_map since
it is specific to vnode-based range calculations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
34b223f6f9 replica: database: keyspace: rename {create,update}_effective_replication_map
to *_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
688bd4fd43 locator: effective_replication_map_factory: rename create_effective_replication_map
to create_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
cbad497859 locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
2ab44e871b locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
to global_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:49 +03:00
Benny Halevy
bd62421c05 keyspace: rename get_vnode_effective_replication_map
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:40:43 +03:00
Benny Halevy
33f34c8c32 dht: range_streamer: use naked e_r_m pointers
Prepare for following patch that will separate
the local effective replication map from
vnode_effective_replication_map.

The caller is responsible to keep the
effective_replication_map_ptr alive while
in use by low-level async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Benny Halevy
d6d434b1c2 storage_service: use naked e_r_m pointers
Prepare for following patch that will separate
the local effective replication map from
vnode_effective_replication_map.

The caller is responsible to keep the
effective_replication_map_ptr alive while
in use by low-level async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Benny Halevy
59375e4751 alternator: ttl: use naked e_r_m pointers
Prepare for following patch that will separate
the local effective replication map from
vnode_effective_replication_map.

The caller is responsible to keep the
effective_replication_map_ptr alive while
in use by low-level async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Benny Halevy
ec85678de1 locator: abstract_replication_strategy: define is_local
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().

Note that is_vnode_based() still applies to local replication
strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Nikos Dragazis
1eb99fb5f5 test/lib: Add option to preserve tmpdir on exception
Extend the tmpdir class with an option to preserve the directory if the
destructor is called during stack unwinding (i.e., uncaught exception).
To be used in tests where the tmpdir contains non-temporary resources
that may help in diagnosing test failures (e.g., logs from external
services such as PyKMIP).

This will be used in the next patch.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-08-06 13:07:52 +03:00
Pavel Emelyanov
0616407be5 Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.

Fixes scylladb/scylladb#19061

Backport is not required, it is new functionality

Closes scylladb/scylladb#25063

* github.com:scylladb/scylladb:
  docs: Add documentation for the nodetool dropquarantinedsstables command
  nodetool: add command for dropping quarantine sstables
  rest_api: add endpoint which drops all quarantined sstables
2025-08-06 11:55:15 +03:00
Nadav Har'El
10588958e0 test/alternator: add regression test for keep-alive support
An Alternator user complained about suspiciously many new connections being
opened, which raised a suspicion that maybe Alternator doesn't support
HTTP and HTTPS keep-alive (allowing a client to reuse the same connection
for multiple requests). It turns out that we never had a regression test
that this feature actually works (and doesn't break), so this patch adds
one.

The test confirms that Alternator's connection reuse (keep-alive) feature
actually works correctly. Of course, only if the driver really tries to
reuse a connection - which is a separate question and needs testing on
the driver side (scylladb/alternator-load-balancing#82).

The test sends two requests using Python's "requests" library which can
normally reuse connections (it uses a "connection pool"), and checks if the
connection was really reused. Unfortunately "requests" doesn't give us
direct knowledge of whether or not it reused a connection, so we check
this using simple monkey-patching. I actually tried multiple other
approaches before settling on this one. The approach needs to work
on both HTTP and HTTPS, and also on AWS DynamoDB.

Importantly, the test checks both keep-alive and non-keep-alive cases.
This is very important for validating the test itself and its tricky
monkey-patching code: The test is meant to detect when the socket is not
reused for the second request, so we want to also check the non-keep-
alive case where we know the socket isn't reused, to see the test code
really detected this situation.

By default, this test runs (like all of Alternator's test suite) on HTTP
sockets. Running this test with "test/alternator/run --https" will run
it on HTTPS sockets. The test currently passes on both HTTP and HTTPS.
It also passes on AWS DynamoDB ("test/alternator/run --aws")

Fixes #23067

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25202
2025-08-06 11:41:21 +03:00
Avi Kivity
630b3d31bb storage_proxy: reduce allocations in send_to_live_endpoints()
send_to_live_endpoints() computes sets of endpoints to
which we send mutations - remote endpoints (where we send
to each set as a whole, using forwarding), and local endpoints,
where we send directly. To make handling regular, each local
endpoint is treated as its own set. Thus, each local endpoint
and each datacenter receive one RPC call (or local call if the
coordinator is also a replica).

These sets are maintained a std::unordered_map (for remote endpoints)
and a vector with the same value_type as the map (for local endpoints).
The key part of the vector payload is initialized to the empty string.

We simplify this by noting that the datacenter name is never used
after this computation, so the vector can hold just the replica sets,
without the fake datacenter name. The downstream variable `all` is
adjusted to point just to the replica set as well.

As a reward for our efforts, the vector's contents becomes nothrow
move constructible (no string), and we can convert it to a small_vector,
which reduces allocations in the common case of RF<=3.

The reduction in allocations is visible in perf-simple-query --write
results:

```
before 165080.62 tps ( 60.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   53438 insns/op,   26705 cycles/op,        0 errors)

after  164513.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   53347 insns/op,   26761 cycles/op,        0 errors)
```

The instruction count reduction is a not very impressive 70/op:

before
```
instructions_per_op:
	mean=   53412.22 standard-deviation=32.12
	median= 53420.53 median-absolute-deviation=20.32
	maximum=53462.23 minimum=53290.06
```

after
```
instructions_per_op:
	mean=   53350.32 standard-deviation=32.38
	median= 53353.71 median-absolute-deviation=13.60
	maximum=53415.20 minimum=53222.24
```

Perhaps the extra code from small_vector defeated some inlining,
which negated some of the gain from the reduced allocations. Perhaps
a build with full profiling will gain it back (my builds were without
pgo).

Closes scylladb/scylladb#25270
2025-08-06 11:28:20 +03:00
Karol Nowacki
032e8f9030 test/boost/vector_store_client_test.cc: Fix flaky tests
The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server.

Problem:
The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test.
Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response.

Solution:
The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm.
This ensures that incoming connections are always handled on the same shard where the HTTP server is running.

Closes scylladb/scylladb#25314
2025-08-06 11:24:51 +03:00
Taras Veretilnyk
bcb90c42e4 docs: Sort commands list in nodetool.rst
Fixes scylladb/scylladb#25330

Closes scylladb/scylladb#25331
2025-08-06 11:20:53 +03:00
Nikos Dragazis
b1d5a67018 encryption: gcp: Fix the grant type for user credentials
Exchanging a refresh token for an access token requires the
"refresh_token" grant type [1].

[1] https://datatracker.ietf.org/doc/html/rfc6749#section-6

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-08-06 10:39:17 +03:00
Nadav Har'El
fa86405b1f test/alternator: utility functions for changing configuration
Now that the previous patch made it possible to write to system tables
in Alternator tests, this patch introduces utility functions for changing
the configuration - scylla_config_write() in addition to the
scylla_config_read() we already had, and scylla_config_temporary() to
temporarily change a configurable parameter and then restore it to its
old value.

This patch adds a silly test that temporarily modifies the
query_tombstone_page_limit configuration parameter. Later we can
add more tests that use the new test functions for more "serious"
testing of real features. In particular, we don't have an Alternator
test for the max_concurrent_requests_per_shard configuration - and
I want to write one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-06 10:02:24 +03:00
Nadav Har'El
a896e2dbb9 alternator: add optional support for writing to system table
In commit 44a1daf we added the ability to read system tables through
the DynamoDB API (actually, the Scan and Query requests only).
This ability is useful for tests, and can also be useful to users who
want to read information that is only available through system tables.

This patch adds support also for *writing* into system tables. This will
be useful for Alternator tests, were we want to temporarily change
some live-updatable configuration option - and so far haven't been
able to do that like we did do in some cql-pytest tests.

For reasons explained in issue #23218, only superuser roles are allowed to
write to system tables - it is not enough for the role to be granted
MODIFY permissions on the system table or on ALL KEYSPACES. Moreover,
the ability to modify system tables carries special risks, so this
patch only allows writes to the system tables if a new configuration
option "alternator_allow_system_table_write" turned on. This option is
turned off by default.

This patch also includes a test for this new configuration-writing
capability. The test scripts test/alternator/run and test.py now
run Scylla with alternator_allow_system_table_write turned on, but
the new test can also run without this option, and will be skipped
in that case (to allow running the test suite against some manually-
run instance of Scylla).

Fixes: #12348

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-06 10:00:04 +03:00
Nadav Har'El
5913498fff test/alternator: reduce duplicated code
Four tests had almost identical code to read an item from Scylla
configuration (using the system.config system table). It's time
to make this into a new utility function, scylla_config_read().

This is a good time to do it, because in a later patch I want
to also add a similar function to *write* into the configuration.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-06 09:56:47 +03:00
Nadav Har'El
d46dda0840 Merge 'cql, vector_search: implement read path' from null
This pull request is an addition of ANN OF queries.

The patch contains:

- CQL syntax for ORDER BY `vector_column_name` ANN OF `vector_literal` clause of SELECT statements.
- implementation of external ANN queries (using vector-store service)
- tests

Example syntax:

```
SELECT comment
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
    LIMIT 3;
```
Limit can be between 1 and 1000 - same as for Cassandra.

Co-authored-by: @janpiotrlakomy @smoczy123
Fixes: VECTOR-48
Fixes: VECTOR-46

Closes scylladb/scylladb#24444

* github.com:scylladb/scylladb:
  cql3/statements: implement external `ANN OF` queries
  vector_store_client: implement ann_error_visitor
  test/cqlpy: check ANN queries disallow filtering properly
  cassandra_tests: translate vector_invalid_query_test
  cassandra_tests: copy vector_invalid_query_test from Cassandra
  vector_index: make parameter names case insensitive
  cql3/statements: add `ANN OF` queries support to select statements
  cql/Cql.g: extend the grammar to allow for `ANN OF` queries
  cql3/raw: add ANN ordering to the raw statement layer
2025-08-06 09:53:38 +03:00
Nikos Dragazis
77cc6a7bad encryption: gcp: Expand tilde in pathnames for credentials file
The GCP host searches for application default credentials in known
locations within the user's home directory using
`seastar::file_exists()`. However, this function does not perform tilde
expansion in pathnames.

Replace tildes with the home directory from the HOME environment
variable.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-08-06 09:46:08 +03:00
Avi Kivity
bb922b2aa9 Merge 'truncate: change check for write during truncate into a log warning' from Ferenc Szili
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.

The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.

This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted

Fixes: #25173
Fixes: #25013

Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1

Closes scylladb/scylladb#25174

* github.com:scylladb/scylladb:
  truncate: add test for truncate with concurrent writes
  truncate: change check for write during truncate into a log warning
2025-08-06 00:03:37 +03:00
Michał Chojnowski
9930cd59eb sstables/trie: add tests for BTI node serialization and traversal
Adds tests which check that nodes serialized by `bti_node_sink`
are readable by `bti_node_reader` with the right result.

(Note: there are no tests which check compatibility of the encoded nodes
with Cassandra or with handwritten hexdumps. There are only tests
for mutual compatibility between Scylla's writers and readers.
This can be considered a gap in testing.)
2025-08-05 21:48:24 +02:00
Pavel Emelyanov
10056a8c6d Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.

To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.

Fixes: https://github.com/scylladb/scylladb/issues/25044

Should be backported to 2025.3 since we need this fix for the restore

Closes scylladb/scylladb#24961

* github.com:scylladb/scylladb:
  s3_creds: code cleanup
  s3_creds: Make `reload` unconditional
  s3_creds: Add test exposing credentials renewal issue
2025-08-05 17:49:13 +03:00
Michael Litvak
faebfdf006 test/cluster/test_tablets_colocation: fix flaky test
When restarting the server in the test, wait for it to become ready
before requesting tablet repair.

Fixes scylladb/scylladb#25261

Closes scylladb/scylladb#25263
2025-08-05 15:36:03 +02:00
Avi Kivity
4c785b31c7 Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El
Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code.  Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field.

Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator *connections*, we will list the currently active *requests*.

The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients.

Fixes #24993

This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request.

Closes scylladb/scylladb#25178

* github.com:scylladb/scylladb:
  test/cqlpy: slightly strengthen test for system.clients
  generic_server: use utils::scoped_item_list
  docs/alternator: document the system.clients system table in Alternator
  alternator: add test for Alternator clients in system.clients
  alternator: list active Alternator requests in system.clients
  utils: unit test for utils::scoped_item_list
  utils: add a scoped_item_list utility class
  utils: add "fatal" version of utils::on_internal_error()
2025-08-05 15:55:41 +03:00
Ferenc Szili
33488ba943 truncate: add test for truncate with concurrent writes
test_validate_truncate_with_concurrent_writes checks if truncate deletes
all the data written before the truncate starts, and does not delete any
data after truncate completes.
2025-08-05 13:54:14 +02:00
Jan Łakomy
447c66f4ec cql3/statements: implement external ANN OF queries
Implement execution of `ANN OF` queries using the vector_store service.

Throw invalid_request_exception with specific message using
the ann_error_visitor when ANN request returns no result.

Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Co-authored-by: Michał Hudobski <michal.hudobski@scylladb.com>
2025-08-05 12:34:48 +02:00
Dawid Pawlik
7a826b79d9 vector_store_client: implement ann_error_visitor
Implement ann_error_visitor managing error messages depending on
ANN error type received.
2025-08-05 12:34:48 +02:00
Dawid Pawlik
74f603fe99 test/cqlpy: check ANN queries disallow filtering properly
Add tests checking if filtering with clustering column
or using index is disallowed while performing ANN query.
2025-08-05 12:34:48 +02:00
Pavel Emelyanov
5fcdf948d9 doc: Update system.clients schema with scheduling_group cell
It was added by 9319d65971 (db/virtual_tables: add scheduling group
column to system.clients) recently.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25294
2025-08-05 10:16:20 +03:00
Michał Chojnowski
85964094f6 sstables/trie: implement BTI node traversal
This commit implements routines for traversal of BTI nodes in their
on-disk format.
The `node_reader` concept is currently unused (i.e. not asserted by any
template).

It will only be used in the next PR, which will implement trie cursor
routines parametrized `node_reader`.
But I'm including it in this PR to make it clear which functions
will be needed by the higher layer.
2025-08-05 00:56:50 +02:00
Michał Chojnowski
302adfb50d sstables/trie: implement BTI serialization
This commit introduces code responsibe for serializing
trie nodes (`writer_node`) into the on-disk BTI format,
as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md
2025-08-05 00:56:50 +02:00
Michał Chojnowski
6fe7dbaedc utils/cached_file: add get_shared_page()
BTI index is page-aware. It's designed to be read in page units.

Thus, we want a `cached_file` accessor which explicitly requests
a whole page, preferably without copying it.

`cached_file` already works in terms of reference-counted pages,
underneath. This commit only adds some accessors which lets
us request those reference-counting page pointers more directly.
2025-08-05 00:56:50 +02:00
Michał Chojnowski
58d768e383 utils/cached_file: replace a std::pair with a named struct
Cosmetic change. For clarity.
2025-08-05 00:55:32 +02:00
Artsiom Mishuta
4b975668f6 tiering (test.py): introduce tiering labels
introduce tiering marks
1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next)

2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR,  and can be popped out from the CI run.

set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run,
about 4 hours without paralelization)
1 test as unstable(as exaple ot marker usage)

Closes scylladb/scylladb#24974
2025-08-04 15:38:16 +03:00
Ferenc Szili
268ec72dc9 truncate: change check for write during truncate into a log warning
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.

The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.

Fixes: #25173
Fixes: #25013
2025-08-04 12:24:50 +02:00
Piotr Dulikowski
ec7832cc84 Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
  `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
  first).

These steps are not intuitive and more complicated than they could be.

In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.

Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.

Fixes scylladb/scylladb#25015

The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.

Closes scylladb/scylladb#25032

* github.com:scylladb/scylladb:
  docs: document the option to set recovery_leader later
  test: delay setting recovery_leader in the recovery procedure tests
  gossip: add recovery_leader to gossip_digest_syn
  db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
  db/config, gms/gossiper: change recovery_leader to UUID
  db/config, utils: allow using UUID as a config option
2025-08-04 08:29:32 +02:00
Ernest Zaslavsky
837475ec6f s3_creds: code cleanup
Remove unnecessary code which is no more used
2025-08-04 09:26:11 +03:00
Ernest Zaslavsky
e4ebe6a309 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.
2025-08-03 17:41:35 +03:00
Ernest Zaslavsky
68855c90ca s3_creds: Add test exposing credentials renewal issue
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.
2025-08-03 17:41:25 +03:00
Avi Kivity
1c25aa891b Merge 'storage_proxy.cc: get_cas_shard: fallback to the primary replica shard' from Petr Gusev
Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator.

If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization.

In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side.

backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet

Closes scylladb/scylladb#25224

* github.com:scylladb/scylladb:
  test_tablets_lwt.py: make tests rf_rack_valid
  test_tablets_lwt: add test_lwt_coordinator_shard
  storage_proxy.cc: get_cas_shard: fallback to the primary replica shard
  sharder: add try_get_shard_for_reads method
2025-08-01 23:07:25 +03:00
Avi Kivity
8b1bf46086 Merge 'sstables: introduce trie_writer' from Michał Chojnowski
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).

As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.

This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:

1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).

It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.

This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:

1. The writer operates on chains of characters, rather than single characters.

   In Cassandra's implementation, the writer creates one node per character.
   A single long key can be translated to thousands of nodes.
   We create only one node per key. (Actually we split very long keys into
   a few nodes, but that's arbitrary and beside the point).

   For BTI's partition key index this doesn't matter.
   Since it only stores a minimal unique prefix of each key,
   and the trie is very balanced (due to token randomness),
   the average number of new characters added per key is very close to 1 anyway.
   (And the string-based logic might actually be a small pessimization, since
   manipulating a 1-byte string might be costlier than manipulating a single byte).

   But the row index might store arbitrarily long entries, and in that case the
   character-based logic might result in catastrophically bad performance.
   For reference: when writing a partition index, the total processing cost
   of a single node in the trie_writer is on the order of 800 instructions.
   Total processing cost of a single tiny partition during a `upgradesstables`
   operation is on the order of 10000 instructions. A small INSERT is on the
   order of 40000 instructions.

   So processing a single 1000-character clustering key in the trie_writer
   could cost as much as 20 INSERTs, which is scary. Even 100-character keys
   can be very expensive. With extremely long keys like that, the string-based
   logic is more than ~100x cheaper than character-based logic.
   (Note that only *new* characters matter here. If two index entries share a
   prefix, that prefix is only processed once. And the index is only populated
   with the minimal prefix needed to distinguish neighbours. So in practice,
   long chains might not happen often. But still, they are possible).

   I don't know if it makes sense to care about this case, but I figured the
   potential for problems is too big to ignore, so I switched to chain-based logic.

2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
   than a full page after revising the estimate, Cassandra splits it in a
   different way than us.

For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.

The serialization logic is passed to trie_writer via a template parameter.

There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).

Refs scylladb/scylladb#19191

New functionality, no backporting needed.

Closes scylladb/scylladb#25154

* github.com:scylladb/scylladb:
  sstables: introduce trie_writer
  utils/bit_cast: add object_representation()
2025-08-01 20:23:24 +03:00
Andrei Chekun
c0d652a973 test.py: change boost test stdout to use filehandler instead of pipe
With current implementation if pytest will be killed, it will not be
able to write the stdout from the boost test. With a new way it should
be updated while test executing, instead of writing it the end of the
test.

Closes scylladb/scylladb#25260
2025-08-01 15:05:00 +03:00
Michał Jadwiszczak
10214e13bd storage_service, group0_state_machine: move SL cache update from topology_state_load() to load_snapshot()
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)

Fixes scylladb/scylladb#25114
Fixes scylladb/scylladb#23065

Closes scylladb/scylladb#25116
2025-08-01 13:41:08 +02:00
Jan Łakomy
8b2ed0f014 cassandra_tests: translate vector_invalid_query_test
Translate vector_invalid_query_test which tests parsing of ANN OF syntax.

Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
2025-08-01 12:08:50 +02:00
Jan Łakomy
eec47d9059 cassandra_tests: copy vector_invalid_query_test from Cassandra
Copy over and comment out this tests code from Cassandra for it to be translated later.
2025-08-01 12:08:50 +02:00
Dawid Pawlik
b29e6870fa vector_index: make parameter names case insensitive
The custom index class name 'vector_index' and it's similarity function
options should be case insensitive.

Before the patch the similarity functions had to be written in
SCREAMING_SNAKE_CASE which was not commonly and intuitively used.
Furthermore the Cassandra translated tests used the options written in
snake_case and as we wanted to translate them exactly, we had to be able
to use lower case option.
2025-08-01 12:08:50 +02:00
Jan Łakomy
5fecad0ec8 cql3/statements: add ANN OF queries support to select statements
Add parsing of `ANN OF` queries to the `select_statement` and
`indexed_table_select_statement` classes.
Add a placeholder for the implementation of external ANN queries.

Rename `should_create_view` to `view_should_exist` as it is used
not only to check if the view should be created but also if
the view has been created.

Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
2025-08-01 12:08:50 +02:00
Taras Veretilnyk
15e3980693 docs: Add documentation for the nodetool dropquarantinedsstables command
Fixes scylladb/scylladb#19061
2025-08-01 11:46:33 +02:00
Nikos Dragazis
2656fca504 test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995
2025-08-01 12:11:27 +03:00
Nadav Har'El
2431f92967 alternator, test: add reproducer for issue about immediate LWT timeout
This patch adds a reproducer for issue #16261, where it was reported
that when Alternator read-modify-write (using LWT) operations to the
same partition are sent to different nodes, sometimes the operation
fails immediately, with an InternalServerError claiming to be a "timeout",
although this happens almost immediately (after a few milliseconds),
not after any real timeout.

The test uses 3 nodes, and 3 threads which send RMW operations to different
items in the same partition, and usually (though not with 100% certainty)
it reaches the InternalServerError in around 100 writes by each thread.
This InternalServerError looks like:

    Internal server error: exceptions::mutation_write_timeout_exception
    (Operation timed out for alternator_alternator_Test_1719157066704.alternator_Test_1719157066704 - received only 1 responses from 2 CL=LOCAL_SERIAL.)

The test also prints how much time it took for the request to fail,
for example:
    In incrementing 1,0 on node 1: error after 0.017074108123779297
This is 0.017 seconds - it's not the cas_contention_timeout_in_ms
timeout (1 second) or any other timeout.

If we enable trace logging, adding to topology_experimental_raft/suite.yaml
    extra_scylla_cmdline_options: ["--logger-log-level", "paxos=trace"]
we get the following TRACE-level message in the log:

    paxos - CAS[0] accept_proposal: proposal is partially rejected

This again shows the problem is "uncertainty" (partial rejection) and not
a timeout.

Refs #16261

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19445
2025-08-01 11:58:52 +03:00
Aleksandra Martyniuk
e607ef10cd api: storage_service: do not log the exception that is passed to user
The exceptions that are thrown by the tasks started with API are
propagated to users. Hence, there is no need to log it.

Remove the logs about exception in user started tasks.

Fixes: https://github.com/scylladb/scylladb/issues/16732.

Closes scylladb/scylladb#25153
2025-08-01 09:49:51 +03:00
Nadav Har'El
edc15a3cf5 test/cqlpy: slightly strengthen test for system.clients
We already have a rather rudimentary test for system.clients listing CQL
connections. However, as written the test will pass if system.clients is
empty :-) So let's strengthen the test to verify that there must be at
least one CQL connection listed in system.clients. Indeed, the test runs
the "SELECT FROM system.clients" over one CQL connection, so surely that
connection must be present.

This patch doesn't strengthen this test in any other way - it still has
just one connection, not many, it still doesn't validate the values of
most of the columns, and it is still written to assume the Scylla server
is running on localhost and not running any other workload in parallel.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:32:19 +03:00
Nadav Har'El
ce0ee27422 generic_server: use utils::scoped_item_list
A previous patch introduced utils::scoped_item_list, which maintains
a list of items - such as a list of ongoing connections - automatically
removing the item from the list when its handle is destroyed. The list
can also be iterated "gently" (without risking stalls when the list is
long).

The implementation of this class was based on very similar code in
generic_server.hh / generic_server.cc. So in this patch we change
generic_server use the new scoped_item_list, and drop its own copy
of the duplicated logic of maintaining the list and iterating gently
over it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:32:14 +03:00
Nadav Har'El
70c94ac9dd docs/alternator: document the system.clients system table in Alternator
Add to docs/alternator/new-apis.md a full description of the
`system.clients` support in Alternator that was added in the previous
patches.

Although arguably *all* Scylla system tables should work on Alternator
and do not need to be individually documented, I believe that this
specific table, is interesting to document. This is because some of
the attributes in this table have non-obvious and Alternator-specific
meanings. Moreover, there's even a diffence in what each individual
item in the table represents (it represents active requests, not entire
connections as in CQL).

While editing the system tables section of new-apis.md, this patch also slightly
improves its formatting.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:05 +03:00
Nadav Har'El
5baa4c40fd alternator: add test for Alternator clients in system.clients
This patch adds a regression test for the feature added in the previous patch,
i.e that the system.clients virtual table also lists ongoing Alternator request.

The new test reads the system.clients system table using an Alternator Scan
request, so it should see its own request - at least - in the result. It
verifies that it sees Alternator requests (at least one), and that these
requests have the expected fields set, and for a couple of fields, we
even know which value to expect (the "client_type" field is "alternator",
and the "ssl_enabled" field depends on whether the test is checking an
http:// or https:// URL (you can try both in test/alternator/run - by
using or not using the "--https" parameter).

The new test fails before the previous patch (because system.clients
will not list any Alternator connection), and passes after it.

As all tests in test_system_tables.py for Scylla-specific system tables,
this test is marked scylla_only and skipped when running on AWS DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:05 +03:00
Nadav Har'El
c14b9c5812 alternator: list active Alternator requests in system.clients
Today, the "system.clients" virtual table lists active connections (and
their various properties, like client address, logged in username and
client version) only for CQL requests. In this patch we make Alternator
active clients also be listed on this virtual table.

Unlike CQL where logged in username applies to a complete connection,
in the Alternator API, different requests, theoretically signed by
different users, can arrive over the same HTTP connection. So instead of
listing the currently open *connections*, we list the currently active
*requests*.

This means that when scanning system.clients, you will only see requests
which are being handled right now - and not inactive HTTP connections.
I think this good enough (besides being the correct thing to do) - one
of the goals of this system.clients is to be able to see what kind of
drivers are being used by the user (the "driver_name" field in the
system.clients) - on a busy server there will always be some (even many)
requests being handled, so we'll always have plenty of requests to see
in system.clients.

By the way, note that for Alternator requests, what we use for the
"driver_name" is the request's User-Agent header. AWS SDKs typically
write the driver's name, its version, and often a lot of other
information in that header. For example, Boto3 sends a User-Agent
looking like:

    Boto3/1.38.46 md/Botocore#1.38.46 md/awscrt#0.24.2
    ua/2.1 os/linux#6.15.4-100.fc41.x86_64 md/arch#x86_64
    lang/python#3.13.5 md/pyimpl#CPython m/N,P,b,D,Z
    cfg/retry-mode#legacy Botocore/1.38.46 Resource

A functional test for the new feature - adding Alternator requests to
the system.clients table - will be in the next patch.

Fixes #24993

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:05 +03:00
Nadav Har'El
20b31987e1 utils: unit test for utils::scoped_item_list
The previous test introduced a new utility class, utils::scoped_item_list.
This patch adds a comprehensive unit test for the new class.

We test basic usage of scoped_item_list, its size() and empty() methods,
how items are removed from the list when their handle goes out of scope,
how a handle's move constructor works, how items can be read and written
through their handles, and finally that removing an item during a
for_each_gently() iteration doesn't break the iteration.

One thing I still didn't figure out how to properly test is how removing
an item during *multiple* iterations that run concurrently fixes
multiple iterators. I believe the code is correct there (we just have a
list of ongoing iterations - instead of just one), but haven't found
yet a way to reproduce this situation in a test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:04 +03:00
Nadav Har'El
186e6d3ce0 utils: add a scoped_item_list utility class
In a later patch, we'll want Alternator to maintain a list of ongoing
requests, and be able to list them when the system.clients table is
read. This patch introduces a new container, utils::scoped_item_list<T>,
that will help Alternator do that:

  1. Each request adds an item to the list, and receives a handle;
     When that handle goes out of scope the item is automatically
     deleted from the list.
  2. Also a method is provided for iterating over the list of items
     without risking a stall if the list is very long.

The new scoped_item_list<T> is heavily based on similar code that is
integrated inside generic_server.hh, which is used by CQL to similarly
maintain a list of active connections and their properties. However,
unfortunately that code is deeply integrated into the generic_server
class, and Alternator can't use generic_server because it uses Seastar's
HTTP server which isn't based on generic_server.

In contrast, the container defined in this patch is stand-alone and does
not depend on Alternator in any way. In a later patch in this series we
will modify generic_server to use the new scoped_item_list<> instead of
having that feature inside it.

The next patch is a unit test for the new class we are adding in this
patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:04 +03:00
Nadav Har'El
33476c7b06 utils: add "fatal" version of utils::on_internal_error()
utils::on_internal_error() is a wrapper for Seastar's on_internal_error()
which does not require a logger parameter - because it always uses one
logger ("on_internal_error"). Not needing a unique logger is especially
important when using on_internal_error() in a header file, where we
can't define a logger.

Seastar also has a another similar function, on_fatal_internal_error(),
for which we forgot to implement a "utils" version (without a logger
parameter). This patch fixes that oversight.

In the next patch, we need to use on_fatal_internal_error() in a header
file, so the "utils" version will be useful. We will need the fatal
version because we will encounter an unexpected situation during server
destruction, and if we let the regular on_internal_error() just throw
an exception, we'll be left in an undefined state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:04 +03:00
Patryk Jędrzejczak
e53dc7ca86 Merge 'remove unused function and simplify some qp code.' from Gleb Natapov
No backport needed since these are cleanups.

Closes scylladb/scylladb#25258

* https://github.com/scylladb/scylladb:
  qp: fold prepare_one function into its only caller
  qp: co-routinize prepare_one function
  cql3: drop unused function
2025-07-31 18:19:47 +02:00
Taras Veretilnyk
1d6808aec4 topology_coordinator: Make tablet_load_stats_refresh_interval configurable
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.

Fixes scylladb/scylladb#24641

Closes scylladb/scylladb#24746
2025-07-31 14:31:55 +03:00
Gleb Natapov
041011b2ee qp: fold prepare_one function into its only caller 2025-07-31 14:12:34 +03:00
Gleb Natapov
715f1d994f qp: co-routinize prepare_one function 2025-07-31 14:11:17 +03:00
Michał Chojnowski
c8682af418 sstables: introduce trie_writer
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).

As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.

This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:

1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).

It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.

This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:

1. The writer operates on chains of characters, rather than single characters.

   In Cassandra's implementation, the writer creates one node per character.
   A single long key can be translated to thousands of nodes.
   We create only one node per key. (Actually we split very long keys into
   a few nodes, but that's arbitrary and beside the point).

   For BTI's partition key index this doesn't matter.
   Since it only stores a minimal unique prefix of each key,
   and the trie is very balanced (due to token randomness),
   the average number of new characters added per key is very close to 1 anyway.
   (And the string-based logic might actually be a small pessimization, since
   manipulating a 1-byte string might be costlier than manipulating a single byte).

   But the row index might store arbitrarily long entries, and in that case the
   character-based logic might result in catastrophically bad performance.
   For reference: when writing a partition index, the total processing cost
   of a single node in the trie_writer is on the order of 800 instructions.
   Total processing cost of a single tiny partition during a `upgradesstables`
   operation is on the order of 10000 instructions. A small INSERT is on the
   order of 40000 instructions.

   So processing a single 1000-character clustering key in the trie_writer
   could cost as much as 20 INSERTs, which is scary. Even 100-character keys
   can be very expensive. With extremely long keys like that, the string-based
   logic is more than ~100x cheaper than character-based logic.
   (Note that only *new* characters matter here. If two index entries share a
   prefix, that prefix is only processed once. And the index is only populated
   with the minimal prefix needed to distinguish neighbours. So in practice,
   long chains might not happen often. But still, they are possible).

   I don't know if it makes sense to care about this case, but I figured the
   potential for problems is too big to ignore, so I switched to chain-based logic.

2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
   than a full page after revising the estimate, Cassandra splits it in a
   different way than us.

For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.

The serialization logic is passed to trie_writer via a template parameter.

There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
2025-07-31 12:51:37 +02:00
Calle Wilund
43f7eecf9e compress: move compress.cc/hh to sstables/compressor
Fixes #22106

Moves the shared compress components to sstables, and rename to
match class type.

Adjust includes, removing redundant/unneeded ones where possible.

Closes scylladb/scylladb#25103
2025-07-31 13:10:41 +03:00
Pavel Emelyanov
34608450c5 Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.

Fixes: scylladb/scylladb#24963

Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).

Closes scylladb/scylladb#25188

* github.com:scylladb/scylladb:
  test: sl: verify that legacy auth is not queried in sl to raft upgrade
  qos: don't populate effective service level cache until auth is migrated to raft
2025-07-31 13:05:27 +03:00
Botond Dénes
7e27157664 replica/table: add_sstables_and_update_cache(): remove error log
The plural overload of this method logs an error when the sstable add
fails. This is unnecessary, the caller is expected to catch and handle
exceptions. Furthermore, this unconditional error log results in
sporadic test failures, due to the unexpected error in the logs on
shutdown.

Fixes: #24850

Closes scylladb/scylladb#25235
2025-07-31 12:34:40 +03:00
Jan Łakomy
e69e0cb546 cql/Cql.g: extend the grammar to allow for ANN OF queries
Extend `orderByClause` so that it can accept the `ORDER BY 'column_name' ANN OF 'vector_literal'` syntax.

Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
2025-07-31 11:11:24 +02:00
Jan Łakomy
d073a4c1fa cql3/raw: add ANN ordering to the raw statement layer
Extend `orderings_type` to include ANN ordering.

Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
2025-07-31 11:11:24 +02:00
Petr Gusev
3500a10197 scylla_cluster.py: add try_get_host_id
Tests sometimes fail in ScyllaCluster.add_server on the
'replaced_srv.host_id' line because host_id is not resolved yet. In
this commit we introduce functions try_get_host_id and get_host_id
that resolve it when needed.

Closes scylladb/scylladb#25177
2025-07-31 10:37:06 +02:00
Patryk Jędrzejczak
c41f0e6da9 Merge 'generic server: 2 step shutdown' from Sergey Zolotukhin
This PR implements solution proposed in scylladb/scylladb#24481

Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.

The updated shutdown process is as follows:

1. Initial Shutdown Phase
   * Close the accept gate to block new incoming connections.
   * Abort all accept() calls.
   * For all active connections:
      * Close only the input side of the connection to prevent new requests.
      * Keep the output side open to allow responses to be sent.

2. Drain Phase
   * Wait for all in-progress requests to either complete or fail.

3. Final Shutdown Phase
   * Fully close all connections.

Fixes scylladb/scylladb#24481

Closes scylladb/scylladb#24499

* https://github.com/scylladb/scylladb:
  test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`,  decrease request timeout.
  generic_server: Two-step connection shutdown.
  transport: consmetic change, remove extra blanks.
  transport: Handle sleep aborted exception in sleep_until_timeout_passes
  generic_server: replace empty destructor with `= default`
  generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
  generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
  test: Add test for query execution during CQL server shutdown
2025-07-31 10:32:30 +02:00
Nadav Har'El
78c10af960 test/cqlpy: add reproducer for INSERT JSON .. IF NOT EXISTS bug
This patch adds an xfailing test reproducing a bug where when adding
an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is
ignored.

This bug has been known for 4 years (issue #8682) and even has a FIXME
referring to it in cql3/statements/update_statement.cc, but until now
we didn't have a reproducing test.

The tests in this patch also show that this bug is specific to
INSERT JSON - regular INSERT works correctly - and also that
Cassandra works correctly (and passes the test).

Refs #8682

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25244
2025-07-30 20:14:50 +03:00
Piotr Smaron
8d5249420b Update seastar submodule
* seastar 60b2e7da...7c32d290 (14):
  > posix: Replace static_assert with concept
  > tls: Push iovec with the help of put(vector<temporary_buffer>)
  > io_queue: Narrow down friendship with reactor
  > util: drop concepts.hh
  > reactor: Re-use posix::to_timespec() helper
  > Fix incorrect defaults for io queue iops/bandwidth
  > net: functions describing ssl connection
  > Add label values to the duplicate metrics exception
  > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov
    test: Add unit test for cross-sched-groups wakeups
    test: Add unit test for fair CPU scheduling
    test: Add unit test for basic supergrops manipulations
    test: Add perf test for context switch latency
    scheduling: Add an internal method to get group's supergroup
    reactor: Add supergroup get_shares() API
    reactor: Add supergroup::set_shares() API
    reactor: Create scheduling groups in supergroups
    reactor: Supergroups destroying API
    reactor: Supergroups creating API
    reactor: Pass parent pointer to task_queue from caller
    reactor: Wakeup queue group on child activation
    reactor: Add pure virtual sched_entity::run_tasks() method
    reactor: Make task_queue_group be sched_entity too
    reactor: Split task_queue_group::run_some_tasks()
    reactor: Count and limit supergroup children
    reactor: Link sched entity to its parent
    reactor: Switch activate(task_queue*) to work on sched_entity
    reactor: Move set_shares() to sched_entity()
    reactor: Make account_runtime() work with sched_entity
    reactor: Make insert_activating_task_queue() work on sched_entity
    reactor: Make pop_active_task_queue() work on sched_entity
    reactor: Make insert_active_task_queue() work on sched_entity
    reactor: Move timings to sched_entity
    reactor: Move active bit to sched_entity
    reactor: Move shares to sched_entity
    reactor: Move vruntime to sched_entity
    reactor: Introduce sched_entity
    reactor: Rename _activating_task_queues -> _activating
    reactor: Remove local atq* variable
    reactor: Rename _active_task_queues -> _active
    reactor: Move account_runtime() to task_queue_group
    reactor: Move vruntime update from task_queue into _group
    reactor: Simplify task_queue_group::run_some_tasks()
    reactor: Move run_some_tasks() into task_queue_group
    reactor: Move insert_activating_task_queues() into task_queue_group
    reactor: Move pop_active_task_queue() into task_queue_group
    reactor: Move insert_active_task_queue() into task_queue_group
    reactor: Introduce and use task_queue_group::activate(task_queue)
    reactor: Introduce task_queue_group::active()
    reactor: Wrap scheduling fields into task_queue_group
    reactor: Simplify task_queue::activate()
    reactor: Rename task_queue::activate() -> wakeup()
    reactor: Make activate() method of class task_queue
    reactor: Make task_queue::run_tasks() return bool
    reactor: Simplify task_queue::run_tasks()
    reactor: Make run_tasks() method of class task_queue
  > Fix hang in io_queue for big write ioproperties numbers
  > split random io buffer size in 2 options
  > reactor: document run_in_background
  > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar
    Add unit test for validating computed params in io_queue
    Move `disk_params` and `disk_config_params` to their own unit
    Add an overload for `disk_config_params::generate_config`

Closes scylladb/scylladb#25254
2025-07-30 16:44:18 +03:00
Patryk Jędrzejczak
5ce16488c9 Merge 'test/cqlpy: two small fixes for "--release" feature' from Nadav Har'El
This small series fixes two small bugs in the "--release" feature of test/cqlpy/run and test/alternator/run, which allows a developer to run signle-node functional tests against any past release of Scylla. The two patches fix:

1. Allow "run --release" to be used when Scylla has not even been built from source.
2. Fix a mistake in choosing the most recent release when only a ".0" and RC releases are available. This is currently the case for the 2025.2 branch, which is why I discovered the bug now.

Fixes #25223

This patch only affects developer's experience if using the test/cqlpy/run script manually (these scripts are not used by CI), so should not be backported.

Closes scylladb/scylladb#25227

* https://github.com/scylladb/scylladb:
  test/cqlpy: fix fetch_scylla.py for .0 releases
  test/cqlpy: fix "run --release" when Scylla hasn't been built
2025-07-30 15:13:26 +02:00
Petr Gusev
dea41b1764 test_tablets_lwt.py: make tests rf_rack_valid
This is a refactoring commit. Remove the rf_rack_valid_keyspaces: False
flag because rf_rack_validy is going to become mundatory in
scylladb/scylladb#23526
2025-07-30 13:48:33 +02:00
Aleksandra Martyniuk
99ff08ae78 streaming: close sink when exception is thrown
If an exception is thrown in result_handling_cont in streaming,
then the sink does not get closed. This leads to a node crash.

Close sink in exception handler.

Fixes: https://github.com/scylladb/scylladb/issues/25165.

Closes scylladb/scylladb#25238
2025-07-30 14:26:14 +03:00
Petr Gusev
bd82a9d7e5 test_tablets_lwt: add test_lwt_coordinator_shard
Check that an LWT coordinator which is not a replica runs on the
same shard as a replica.
2025-07-30 13:08:56 +02:00
Andrei Chekun
d0e4045103 test.py: add repeats in pytest
Previous way of executin repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.
2025-07-30 12:03:08 +02:00
Andrei Chekun
853bdec3ec test.py: add directories and filename to the log files
Currently, only test function name used for output and log files. For better
clarity adding the relative path from the test directory of the file name
without extension to these files.
Before:
test_aggregate_avg.1.log
test_aggregate_avg_stdout.1.log
After:
boost.aggregate_fcts_test.test_aggregate_avg.1.log
boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log
2025-07-30 12:03:08 +02:00
Andrei Chekun
557293995b test.py: rename log sink file for boost tests
Log sink is outputted in XML format not just simple text file. Renaming to have better clarity
2025-07-30 12:03:08 +02:00
Andrei Chekun
cc75197efd test.py: better error handling in boost facade
If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation.
2025-07-30 12:03:08 +02:00
Andrei Chekun
4c33ff791b build: add pytest-timeout to the toolchain
Adding this plugin allows using timeout for a test or timeout for the whole
session. This can be useful for Unit Test Custom task in the pipeline to avoid
running tests is batches, that will mess with the test names later in Jenkins.

Closes #25210

[avi: regenerate frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes scylladb/scylladb#25243
2025-07-30 12:53:10 +03:00
Gleb Natapov
e496a89f80 cql3: drop unused function 2025-07-30 12:17:23 +03:00
Avi Kivity
5e150eafa4 keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit
Avoids thread_local guards on every access.
2025-07-29 23:55:19 +03:00
Avi Kivity
e2316a4a66 keys: make empty creation clustering_key_prefix constexpr
Short-circuit make_empty() to construct an empty managed_bytes.
Sprinkle constexpr specifiers as needed to make it work.
2025-07-29 23:54:03 +03:00
Avi Kivity
5c6c944797 managed_bytes: make empty managed_bytes constexpr friendly
Sprinkle constexpr where needed to make the default constructor,
move constructor, and destructor constexpr.

Add a test to verify.

This is needed to make a thread_local variable containing an
empty managed_bytes constinit, reducing thread-local guards.
2025-07-29 23:51:43 +03:00
Avi Kivity
3f6d0d832c keys: clustering_bounds_comparator: make _empty_prefix a prefix
_empty_prefix, as its name suggests, is a prefix, but its type
is not. Presumably it works due to implicit conversions.

There should not be a clustering_key::make_empty(), but we'll
suffer it for now.

Fix by making _empty_prefix a prefix.
2025-07-29 23:13:09 +03:00
Petr Gusev
e120ee6d32 storage_proxy.cc: get_cas_shard: fallback to the primary replica shard
Currently, get_cas_shard uses shard_for_reads to decide which
shard to use for LWT execution—both on replicas and the coordinator.

If the coordinator is not a replica, shard_for_reads returns a default
shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT
coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what
shard_for_reads returns on replicas. This hinders the "same shard for
client and server" RPC level optimization.

In this commit we change get_cas_shard to use a primary replica
shard if the current node is not a replica. This guarantees that all
LWT coordinators for the same tablet will be served on the same shard.
This is important for LWT coordinator locks
(paxos::paxos_state::get_cas_lock). Also, if all tablet replicas on
different nodes live on the same shard, RPC
optimization will make sure that no additional smp::submit_to will
be needed on the server side.

Fixes scylladb/scylladb#20497
2025-07-29 17:07:04 +02:00
Botond Dénes
2985c343ed Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He
When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message.

This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression.

This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream.

Tests are added to make sure the message split works.

Fixes #24808

Closes scylladb/scylladb#25002

* github.com:scylladb/scylladb:
  repair: Avoid too many fragments in a single repair_row_on_wire
  repair: Change partition_key_and_mutation_fragments to use chunked_vector
  utils: Allow chunked_vector::erase to work with non-default-constructible type
2025-07-29 17:45:57 +03:00
Patryk Jędrzejczak
8e43856ca7 Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov
When running compactions are aborted by the aforementioned helper, in logs there appear a line like
"Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation".

With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other.

Closes scylladb/scylladb#25136

* https://github.com/scylladb/scylladb:
  compaction: Pass "reason" to perform_task_on_all_files()
  compaction: Pass "reason" to run_with_compaction_disabled()
  compaction: Pass "reason" to stop_and_disable_compaction()
2025-07-29 16:06:17 +02:00
Pavel Emelyanov
286fad4da6 api: Simplify table_info::name extraction with std::views::transform
Instead of using lambda, pass pointer to struct member. The result is
the same, but the code is nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25123
2025-07-29 15:56:58 +02:00
Sergey Zolotukhin
4f63e1df58 test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
2025-07-29 15:37:47 +02:00
Nadav Har'El
22f845b128 docs/alternator: mention missing ShardFilter support
Add in docs/alternator/compatibility.md a mention of the ShardFilter
option which we don't support in Alternator Streams. This option was
only introduced to DynamoDB a week ago, so it's not surprising we
don't yet support it :-)

Refs #25160

Closes scylladb/scylladb#25161
2025-07-29 14:37:24 +03:00
Andrei Chekun
a6a3d119e8 docs: update documentation with new way of running C++ tests
Documentation had outdated information how to run C++ test.
Additionally, some information added about gathered test metrics.

Closes scylladb/scylladb#25180
2025-07-29 14:36:19 +03:00
Dawid Mędrek
408b45fa7e db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190
2025-07-29 14:35:14 +03:00
Anna Stuchlik
b67bb641bc doc: add OS support for ScyllaDB 2025.3
This commit adds the information about support for platforms in ScyllaDB version 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24698

Closes scylladb/scylladb#25220
2025-07-29 14:33:12 +03:00
Anna Stuchlik
8365219d40 doc: add the upgrade guide from 2025.2 to 2025.3
This PR adds the upgrade guide from version 2025.2 to 2025.3.
Also, it removes the upgrade guide existing for the previous version
that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2).

Note that the new guide does not include the "Enable Consistent Topology Updates" page and note,
as users upgrading to 2025.3 have consistent topology updates already enabled.

Fixes https://github.com/scylladb/scylladb/issues/24696

Closes scylladb/scylladb#25219
2025-07-29 14:32:31 +03:00
Avi Kivity
11ee58090c commitlog: replace std::enable_if with a constraint
std::enable_if is obsolete and was replaced with concepts
and constraint.

Replace the std::is_fundamental_v enable_if constraint with
std::integral. The latter is more accurate - std::ntoh()
is not defined for floats, for example. In any case, we only
read integrals in commitlog.

Closes scylladb/scylladb#25226
2025-07-29 12:51:24 +02:00
Michał Chojnowski
6d27065f99 cql3/result_set: set GLOBAL_TABLES_SPEC in metadata if appropriate
Unless the client uses the SKIP_METADATA flag,
Scylla attaches some metadata to query results returned to the CQL
client.
In particular, it attaches the spec (keyspace name, table
name, name, type) of the returned columns.

By default, the keyspace name and table name is present in each column
spec. However, since they are almost always the same for every column
(I can't think of any case when they aren't the same;
it would make sense if Cassandra supported joins, but it doesn't)
that's a waste.

So, as an optimization, the CQL protocol has the GLOBAL_TABLES_SPEC flag.
The flag can be set if all columns belong to the same table,
and if is set, then the keyspace and table name are only written
in the first column spec, and skipped in other column specs.

Scylla sets this flag, if appropriate, in responses to a PREPARE requests.
But it never sets the flag in responses to queries.

But it could. And this patch causes it to do that.

Fixes #17788

Closes scylladb/scylladb#25205
2025-07-29 12:40:12 +03:00
Piotr Dulikowski
3a082d314c test: sl: verify that legacy auth is not queried in sl to raft upgrade
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.
2025-07-29 11:39:17 +02:00
Piotr Dulikowski
2bb800c004 qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
2025-07-29 11:37:37 +02:00
Petr Gusev
801bf42ea2 sharder: add try_get_shard_for_reads method
Currently, we use storage_proxy/get_cas_shard ->
sharder.shard_for_reads to decide which shard to use for LWT code
execution on both replicas and the coordinator.

If the coordinator is not a replica, shard_for_reads returns 0 —
the 'default' shard. This behavior has at least two problems:
* Shard 0 may become overloaded, because all LWT coordinators that are
not replicas will be served on it.
* The zero shard does not match shard_for_reads on replicas, which
hinders the "same shard for client and server" RPC-level optimization.

To fix this, we need to know whether the current node hosts a replica
for the tablet corresponding to the given token. Currently, there is
no API we could use for this. For historical reasons,
sharder::shard_for_reads returns 0 when the node does not host the
shard, which leads to ambiguity.

This commit introduces try_get_shard_for_reads, which returns a
disengaged std::optional when the tablet is not present on
the local node.

We leave shard_for_reads method in the base sharder class, it calls
try_get_shard_for_reads and returns zero by default. We need to rename
tablet_sharder private methods shard_for_reads and shard_for_writes
so that they don't conflict with the sharder::shard_for_reads.
2025-07-29 11:35:54 +02:00
Nadav Har'El
f6a3e6fbf0 sstables: don't depend on fmt 11.1 to build
A recent commit a0c29055e5 added
some trace printouts which print an std::reference_wrapper<>.
Apparently a formatter for this type was only added to fmt
in version 11.1.0, and it doesn't exist on earlier versions,
such as fmt 11.0.2 on Fedora 41.

Let's avoid requiring shiny-new versions of fmt. The workaround
is easy: just unwrap the reference_wrapper - print pr.get()
instead of just pr, and Scylla returns to building correctly on
Fedora 41.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25228
2025-07-29 11:32:06 +02:00
Patryk Jędrzejczak
3299ffba51 Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.

The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.

Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.

Refs #25115

Fixes #24625

Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.

Closes scylladb/scylladb#25151

* https://github.com/scylladb/scylladb:
  raft_group0: split shutdown into abort_and_drain and destroy
  Revert "main.cc: fix group0 shutdown order"
2025-07-29 10:39:00 +02:00
Asias He
e28c75aa79 repair: Avoid too many fragments in a single repair_row_on_wire
When repairing a partition with many rows, we can store many fragments
in a repair_row_on_wire object which is sent as a rpc stream message.

This could cause reactor stalls when the rpc stream compression is
turned on, because the compression compresses the whole message without
any split and compression.

This patch solves the problem at the higher level by reducing the
message size that is sent to the rpc stream.

Tests are added to make sure the message split works.

Fixes #24808
2025-07-29 13:43:53 +08:00
Asias He
266a518e4c repair: Change partition_key_and_mutation_fragments to use chunked_vector
With the change in "repair: Avoid too many fragments in a single
repair_row_on_wire", the

std::list<frozen_mutation_fragment> _mfs;

in partition_key_and_mutation_fragments will not contain large number of
fragments any more. Switch to use chunked_vector.
2025-07-29 13:43:17 +08:00
Asias He
4a4fbae8f7 utils: Allow chunked_vector::erase to work with non-default-constructible type
This is needed for chunked_vector<frozen_mutation_fragment> in repair.
2025-07-29 13:43:17 +08:00
Avi Kivity
d3cdb88fe7 tools: toolchain: dbuild: increase depth of nested podman configuration coverage
The initial support for nested containers (2d2a2ef277) worked on
my machine (tm) and even laptop, but does not work on fresh installs.
This is likely due to changes in where persistent configuration is
stored on the host between various podman versions; even though my
podman is fully updated, it uses configuration created long ago.

Make nested containers work on fresh installs by also configuring
/etc/containers/storage.conf. The important piece is to set graphroot
to the same location as the host.

Verified both on my machine and on a fresh install.

Closes scylladb/scylladb#25156
2025-07-29 08:23:41 +03:00
Botond Dénes
f3ed27bd9e Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov
Nowadays the way to configure an internal service is

1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value

The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.

For the reference: similar changes with other services: #23705 , #20174 , #19166

Closes scylladb/scylladb#25118

* github.com:scylladb/scylladb:
  gms,init: Move get_disabled_features_from_db_config() from gms
  code: Update callers generating feature service config
  gms: Make feature_config a simple struct
  gms: Split feature_config_from_db_config() into two
2025-07-29 08:17:49 +03:00
Anna Stuchlik
18b4d4a77c doc: add tablets support information to the Drivers table
This commit:

- Extends the Drivers support table with information on which driver supports tablets
  and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.

In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
  and replaced with tablets to correctly describe the contents of the table.

Fixes https://github.com/scylladb/scylladb/issues/19471

Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69

Closes scylladb/scylladb#24635
2025-07-29 08:11:42 +03:00
Avi Kivity
f7324a44a2 compaction: demote normal compaction start/end log messages to debug level
Compaction is routine and the log messages pollute the log files,
hiding important information.

All the data is available via `nodetool compactionhistory`.

Reduce noise by demoting those log messages to debug level.

One test is adjusted to use debug level for compaction, since it
listens for those messages.

Closes scylladb/scylladb#24949
2025-07-29 08:02:22 +03:00
Nadav Har'El
e43828c10b test/cqlpy: fix fetch_scylla.py for .0 releases
The test/cqlpy/fetch_scylla.py script is used by test/cqlpy/run and
test/alternator/run to implement their "--release" option - which allows
you to run current tests against any official release of Scylla
downloaded from Scylla's S3 bucket.

When you ask to get release "2025.1", the idea is to fetch the latest
release available in the 2025.1 stream - currently it is 2025.1.5.
fetch_scylla.py does this by listing the available 2025.1 releases,
sorting them and fetching the last one.

We had a bug in the sort order - version 0 was sorted before version
0-rc1, which is incorrect (the version 2025.2.0 came after
2025.2.0~rc1).

For most releases this didn't cause any problem - 0~rc1 was sorted after
0, but 5 (for example) came after both, so 2025.1.5 got downloaded.
But when a release has **only** an rc and a .0 release, we incorrectly
used the rc instead of the .0.

This patch fixes the sort order by using the "/" character, which sorts
before "0", in rc version strings when sorting the release numbers.

Before this patch, we had this problem in "--release 2025.2" because
currently 2025.2 only has RC releases (rc0 and rc1) and a .0 release,
and we wrongly downloaded the rc1. After this patch, the .0 is chosen
as expected:

  $ test/cqlpy/run --release 2025.2
  Chosen download for ScyllaDB 2025.2: 2025.2.0

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-28 22:02:15 +03:00
Nadav Har'El
72358ee9f4 test/cqlpy: fix "run --release" when Scylla hasn't been built
The "--release" option of test/cqlpy/run can be used to run current
cqlpy tests against any official release of Scylla, which is
automatically downloaded from Scylla's S3 bucket. You should be
able to run tests like that even without having compiled Scylla
from source. But we had a bug, where test/cqlpy/run looked for
the built Scylla executable *before* parsing the "--release"
option, and this bug is fixed in this patch.

The Alternator version of the run script, test/alternator/run,
doesn't need to be fixed because it already did things in the
right order.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-28 21:42:02 +03:00
Taras Veretilnyk
3bc9ee10d1 nodetool: add command for dropping quarantine sstables
- Add dropquarantinedsstables command to remove quarantined SSTables
- Support both flag-based (--keyspace, --table) and positional arguments
- Allow targeting all keyspaces, specific keyspace, or keyspace with specified tables

Fixes scylladb/scylladb#19061
2025-07-28 16:55:17 +02:00
Taras Veretilnyk
fa98239ed8 rest_api: add endpoint which drops all quarantined sstables
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.

Fixes scylladb/scylladb#19061
2025-07-28 16:55:17 +02:00
Dawid Mędrek
b41151ff1a test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617
2025-07-28 16:32:59 +02:00
Gleb Natapov
198cfc6fe7 migration manager: do not use group0 on non zero shard
Commit ddc3b6dcf5 added a check of group0 state in
get_schema_for_write(), but group0 client can only be used on shard 0,
and get_schema_for_write() can be called on any shard, so we cannot use
_group0_client there directly. Move assert where we use another group0
function already where it is guarantied to run on shard 0.

Closes scylladb/scylladb#25204
2025-07-28 14:10:01 +02:00
Nadav Har'El
b4fc3578fc Merge 'LWT: enable for tablet-based tables' from Petr Gusev
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.

Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
  * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
  * This breaks LWT correctness in certain scenarios.
  * Failing test cases demonstrating this:
      * test_lwt_state_is_preserved_on_tablet_migration
      * test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
  * Tablets may be placed on arbitrary shards by the tablet balancer.
  * Accessing Paxos state in system tables could require a shard jump, degrading performance.

We move Paxos state into a dedicated Paxos table, colocated with the base table:
  * Each base table gets its own Paxos state table.
  * This table is lazily created on the first LWT operation.
  * Its tablets are colocated with those of the base table, ensuring:
    * Co-migration during tablet movement
    * Co-rebuilding with the base table
    * Shard alignment for local access to Paxos state

Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].

This PR addresses two issues from the "Why doesn't it work for tablets" section  in [1]:
  * Tablet migration vs LWT correctness
  * Paxos table sharding

Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.

References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)

Backport: not needed because this is a new feature.

Closes scylladb/scylladb#24819

* github.com:scylladb/scylladb:
  create_keyspace: fix warning for tablets
  docs: fix lwt.rst
  docs: fix tablets.rst
  alternator: enable LWT
  random_failures: enable execute_lwt_transaction
  test_tablets_lwt: add test_paxos_state_table_permissions
  test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
  test_tablets_lwt: test timeout creating paxos state table
  test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
  test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
  test_tablets_lwt: migrate test_lwt_support_with_tablets
  test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
  test_tablets_lwt: add simple test for LWT
  check_internal_table_permissions: handle Paxos state tables
  client_state: extract check_internal_table_permissions
  paxos_store: handle base table removal
  database: get_base_table_for_tablet_colocation: handle paxos state table
  paxos_state: use node_local_only mode to access paxos state
  query_options: add node_local_only mode
  storage_proxy: handle node_local_only in query
  storage_proxy: handle node_local_only in mutate
  storage_proxy: introduce node_local_only flag
  abstract_replication_strategy: remove unused using
  storage_proxy: add coordinator_mutate_options
  storage_proxy: rename create_write_response_handler -> make_write_response_handler
  storage_proxy: simplify mutate_prepare
  paxos_state: lazily create paxos state table
  migration_manager: add timeout to start_group0_operation and announce
  paxos_store: use non-internal queries
  qp: make make_internal_options public
  paxos_store: conditional cf_id filter
  paxos_store: coroutinize
  feature_service: add LWT_WITH_TABLETS feature
  paxos_state: inline system_keyspace functions into paxos_store
  paxos_state: extract state access functions into paxos_store
2025-07-28 13:19:23 +03:00
Taras Veretilnyk
6b6622e07a docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175
2025-07-28 12:49:26 +03:00
Tomasz Grabiec
55116ee660 topology_coordinator: Trigger load stats refresh after replace
Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet
scheduler needs load stats for the new node (replacing) to make
decisisons.

Fixes #25163

Closes scylladb/scylladb#25181
2025-07-28 11:07:17 +02:00
Sergey Zolotukhin
ea311be12b generic_server: Two-step connection shutdown.
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.

Fixes scylladb/scylladb#24481
2025-07-28 10:08:06 +02:00
Sergey Zolotukhin
7334bf36a4 transport: consmetic change, remove extra blanks. 2025-07-28 10:08:06 +02:00
Sergey Zolotukhin
061089389c transport: Handle sleep aborted exception in sleep_until_timeout_passes
In PR #23156, a new function `sleep_until_timeout_passes` was introduced
to wait until a read request times out or completes. However, the function
did not handle cases where the sleep is aborted via _abort_source, which
could result in WARN messages like "Exceptional future is ignored" during
shutdown.

This change adds proper handling for that exception, eliminating the warning.
2025-07-28 10:08:05 +02:00
Sergey Zolotukhin
27b3d5b415 generic_server: replace empty destructor with = default
This change improves code readability by explicitly marking the destructor as defaulted.
2025-07-28 10:08:05 +02:00
Sergey Zolotukhin
3610cf0bfd generic_server: refactor connection::shutdown to use shutdown_input and shutdown_output
This change improves logging and modifies the behavior to attempt closing
the output side of a connection even if an error occurs while closing the input side.
2025-07-28 10:08:05 +02:00
Sergey Zolotukhin
3848d10a8d generic_server: add shutdown_input and shutdown_output functions to
`connection` class.

The functions are just wrappers for  _fd.shutdown_input() and  _fd.shutdown_output(), with added error reporting.
Needed by later changes.
2025-07-28 10:08:05 +02:00
Sergey Zolotukhin
122e940872 test: Add test for query execution during CQL server shutdown
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.

Test for scylladb/scylladb#24481
2025-07-28 10:08:05 +02:00
Robert Bindar
d921a565de Add open-coredump script depndencies to install-dependencies.sh
Whilst the coredump script checks for prerequisites, the user
experience is not ideal because you either have to go in the
script and get the list of deps and install them or wait for
the script to complain about lacking dependencies one by one.
This commit completes the list of dependencies in the
install script (some of them were already there for Fedora),
so you already have them installed by the time you
get to run the coredump script.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

[avi:
 - remove trailing whitespace
 - regenerate frozen toolchain

Optimized clang binaries generated and stored in

  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes #22369

Closes scylladb/scylladb#25203
2025-07-28 06:45:01 +03:00
Avi Kivity
1930f3e67f Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.

However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.

For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)

Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.

Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.

Preparation for new functionality, no backporting needed.

Closes scylladb/scylladb#25093

* github.com:scylladb/scylladb:
  sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
  test/boost: add a test for inexact index lookups
  sstables/mx/reader: allow passing a custom index reader to the constructor
  sstables/index_reader: remove advance_to
  sstables/mx/reader: handle inexact lookups in `advance_context()`
  sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
  sstables/index_reader: make the return value of `get_partition_key` optional
  sstables/mx/reader: handle "backward jumps" in forward_to
  sstables/mx/reader: filter out partitions outside the queried range
  sstables/mx/reader: update _pr after `fast_forward_to`
2025-07-27 19:39:36 +03:00
Avi Kivity
8180cbcf48 Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.

Add clone() and clone_gently() methods to
allow explicit copies.

* minor optimization, no backport needed

Closes scylladb/scylladb#24978

* github.com:scylladb/scylladb:
  tablets: prevent accidental copy of tablets_map
  locator: tablets: get rid of synchronous mutate_tablet_map
2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar
0c5fa8e154 locator/token_metadata.cc: use chunked_vector to store _sorted_tokens
The `token_metadata_impl` stores the sorted tokens in an `std::vector`.
With a large number of nodes, the size of this vector can grow quickly,
and updating it might lead to oversized allocations.

This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such
issues. It also updates all related code to use `chunked_vector` instead
of `std::vector`.

Fixes #24876

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#25027
2025-07-27 11:29:22 +03:00
Tomasz Grabiec
a1d7722c6d Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk
A tablet repair started with /storage_service/repair_async/ API
bypasses tablet repair scheduler and repairs only the tablets
that are owned by the requested node. Due to that, to safely repair
the whole keyspace, we need to first disable tablet migrations
and then start repair on all nodes.

With the new API - /storage_service/tablets/repair -
tailored to tablet repair requirements, we do not need additional
preparation before repair. We may request it on one node in
a cluster only and, thanks to tablet repair scheduler,
a whole keyspace will be safely repaired.

Both nodetool and Scylla Manager have already started using
the new API to repair tablets.

Refuse repairing tablet keyspaces with /storage_service/repair_async -
403 Forbidden is returned. repair_async should still be used to repair
vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/23008.

Breaking change; no backport.

Closes scylladb/scylladb#24678

* github.com:scylladb/scylladb:
  repair: remove unused code
  api: repair_async: forbid repairing tablet keyspaces
2025-07-27 09:25:42 +02:00
Piotr Dulikowski
44de563d38 Merge 'db/hints: Improve logging' from Dawid Mędrek
We improve logging in critical functions in hinted handoff
to capture more information about the behavior of the module.
That should help us in debugging sessions.

The logs should only be printed during more important events
and so they should not clog the log files.

Backport: not necessary.

Closes scylladb/scylladb#25031

* github.com:scylladb/scylladb:
  db/hints/manager.cc: Add logs for changing host filter
  db/hints: Increase log level in critical functions
2025-07-27 09:25:42 +02:00
Michael Litvak
3ff388cd94 storage service: drain view builder before group0
The view builder uses group0 operations to coordinate view building, so
we should drain the view builder before stopping group0.

Fixes scylladb/scylladb#25096

Closes scylladb/scylladb#25101
2025-07-27 09:25:42 +02:00
Pavel Emelyanov
403a72918d sstables/types.hh: Remove duplicate version.hh inclusion
The latter header in included two times, one is enough

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25109
2025-07-27 09:25:42 +02:00
Pavel Emelyanov
1b9eb4cb9f init.hh: Remove unused forward declarations
The init.hh contains some bits that only main.cc needs. Some of its
forward declarations are neede by neither the headers itself, nor the
main.cc that includes it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25110
2025-07-27 09:25:42 +02:00
Petr Gusev
8b8b7adbe5 raft_group0: split shutdown into abort_and_drain and destroy
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).

However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.

This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.

The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.

Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().

Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().

The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
2025-07-25 17:16:14 +02:00
Michał Chojnowski
b1da5f2d0f sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
After making the sstable reader more permissive,
we can weaken the abstract_index_reader interface.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
be1f54c6d2 test/boost: add a test for inexact index lookups 2025-07-25 11:00:18 +02:00
Michał Chojnowski
810eb93ff0 sstables/mx/reader: allow passing a custom index reader to the constructor
For tests.
Will be used for testing how the data reader reacts to various
combinations of inexact index lookup results.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
fe8ee34024 sstables/index_reader: remove advance_to
`advance_to` is unused now, so remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
03bf6347e2 sstables/mx/reader: handle inexact lookups in advance_context()
`advance_context()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_context()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.

This makes `advance_to` unused, so we remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
11792850dd sstables/mx/reader: handle inexact lookups in advance_to_next_partition()
`advance_to_next_partition()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_to_next_partition()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
a0c29055e5 sstables/mx/reader: handle "backward jumps" in forward_to
A bunch of code assumes that the Data.db stream can only go forward.
But with BTI indexes, if we perform an advance_to, the index can point to a position
which the data reader has already passed, since the index is inexact.

The logic of the data reader ensures that it has stopped
within the last partition range, or just immediately
after it, after reading the next partition key and
noticing that it doesn't belong to the range.

But forward_to can only be used with increasing ranges.
The start of the next range must be greater or equal to the
end of the previous range.

This means that the exact start of the next partition range
must be no earlier than:
1. Before the partition key just read by the data reader,
if the data reader is positioned immediately after a partition key.
2. The start of the first partition after the current data reader
position, if the data reader isn't positioned immediately after a
partition key.

So, if the index returns a position smaller than the current data
reader position, then:
1. If the reader is immediately after a partition key,
we have to reuse this partition key (since we can't go back
in the stream to read it again), and keep reading from
the current position.
2. Otherwise we can safely walk the index to the first partition
that lies no earlier than the current position.
2025-07-25 10:49:58 +02:00
Michał Chojnowski
218b2dffff sstables/mx/reader: filter out partitions outside the queried range
The current index format is exact: it always returns the position of the
first partition in the queried partition range.

But we are about the add an index format where that doesn't have to be the case.
In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares
the reader for that, by skipping the partitions which were read by the
data reader but don't belong to the queried range.

Note: as of this patch, only the "normal path" is ever used.
We add tests exercising these code paths later.

Also note that, as of this patch, actually stepping outside
the queried range would cause the reader to end up in a
state where the underlying parser is positioned right after
partition key immediately following the queried range.
If the reader was forwarded to that key in this state,
it would trip an assert, because the parser can't handle backward
jumps. We will add logic to handle this case in the next patch.
2025-07-25 10:49:57 +02:00
Michał Chojnowski
2b81fdf09b sstables/mx/reader: update _pr after fast_forward_to
In later patches, we will prepare the reader for inexact index
implementations (ones which can return a Data file range that
includes some partitions before or after the queried range).

For that, we will need to filter out the partitions outside of the
range, and for that we need to remember the range. This is the
goal of this patch.

Note that we are storing a reference to an argument of
`fast_forward_to`. This is okay, because the contract
of `mutation_reader` specifies that the caller must
keep `pr` alive until the next `fast_forward_to`
or until the reader is destroyed.
2025-07-25 10:49:57 +02:00
Aleksandra Martyniuk
a7ee2bbbd8 tasks: do not use binary progress for task manager tasks
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.

As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.

Drop expected_children_number as we can't assume that children are able
to show their progresses.

Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.

Fixes: https://github.com/scylladb/scylladb/issues/24650.

Closes scylladb/scylladb#25113
2025-07-25 10:45:32 +03:00
Ran Regev
7c68ee06bf cleanup: remove partition_slice_builder from include
Refs: #22099 (issue)
Refs: #25079 (pr)

remove include for partition_slice_builder
that is not used. makes it clear that
group0_state_machine.cc does not depend on
partition_slice_builder

Closes scylladb/scylladb#25125
2025-07-25 10:45:32 +03:00
Ran Regev
db4f301f0c scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec
Fixes: #24758

Updated scylla.yaml and the help for
scylla --help

Closes scylladb/scylladb#24793
2025-07-25 10:45:32 +03:00
Ferenc Szili
7ce96345bf test: remove test_tombstone_gc_disabled_on_pending_replica
The test test_tombstone_gc_disabled_on_pending_replica was added when
we fixed (#20788) the potential problem with data resurrection during
file based streaming. The issue was occurring only in Enterprise, but
we added the fix in OSS to limit code divergence. This test was added
together with the fix in OSS with the idea to guard this change in OSS.
The real reproducer and test for this fix was added later, after the
fix was ported into Enterprise.
It is in: test/cluster/test_resurrection.py

Since Enterprise has been merged into OSS, there is no more need to
keep the test test_tombstone_gc_disabled_on_pending_replica. Also,
it is flaky with very low probability of failure, making it difficult
to investigate the cause of failure.

Fixes: #22182

Closes scylladb/scylladb#25134
2025-07-25 10:45:32 +03:00
Botond Dénes
837424f7bb Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis
This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping").

In more detail, this patch series consists of:
* Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI).
* The Azure host - the Key Vault endpoint bridge.
* The Azure Key Provider - the interface for the Azure host.
* Unit tests using real Azure resources (credentials and Vault keys).
* Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens).

This is part of the overall effort to support Azure deployments.

Testing done:
* Unit tests.
* Manual test on an Azure VM with a Managed Identity.
* Manual test with credentials from Azure CLI.
* Manual test of `--azure-hosts` cmdline option.
* Manual test of log filtering.

Remaining items:
- [x] Create necessary Azure resources for CI.
- [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201).

Closes https://github.com/scylladb/scylla-enterprise/issues/1077.

New feature. No backport is needed.

Closes scylladb/scylladb#23920

* github.com:scylladb/scylladb:
  docs: Document the Azure Key Provider
  test: Add tests for Azure Key Provider
  pylib: Add mock server for Azure Key Vault
  encryption: Define and enable Azure Key Provider
  encryption: azure: Delegate hosts to shard 0
  encryption: Add Azure host cache
  encryption: Add config options for Azure hosts
  encryption: azure: Add override options
  encryption: azure: Add retries for transient errors
  encryption: azure: Implement init()
  encryption: azure: Implement get_key_by_id()
  encryption: azure: Add id-based key cache
  encryption: azure: Implement get_or_create_key()
  encryption: azure: Add credentials in Azure host
  encryption: azure: Add attribute-based key cache
  encryption: azure: Add skeleton for Azure host
  encryption: Templatize get_{kmip,kms,gcp}_host()
  encryption: gcp: Fix typo in docstring
  utils: azure: Get access token with default credentials
  utils: azure: Get access token from Azure CLI
  utils: azure: Get access token from IMDS
  utils: azure: Get access token with SP certificate
  utils: azure: Get access token with SP secret
  utils: rest: Add interface for request/response redaction logic
  utils: azure: Declare all Azure credential types
  utils: azure: Define interface for Azure credentials
  utils: Introduce base64url_{encode,decode}
2025-07-25 10:45:32 +03:00
Ernest Zaslavsky
d2c5765a6b treewide: Move keys related files to a new keys directory
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.

Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh

Fixes: #22102
Fixes: #22103
Fixes: #22105

Closes scylladb/scylladb#25082
2025-07-25 10:45:32 +03:00
Calle Wilund
a86e8d73f2 encryption_at_rest_test: ensure proxy connection flushing
Refs #24551

Drops background flush for proxy output stream (because test), and
also ensures we do explicit flush + close on exception in write loop.

Ensures we don't hide actual exceptions with asserts.

Closes scylladb/scylladb#25146
2025-07-25 10:45:32 +03:00
Petr Gusev
aae5260147 create_keyspace: fix warning for tablets
Remove LWT from the list of unsupported features.
2025-07-24 20:04:43 +02:00
Petr Gusev
1f5d9ace93 docs: fix lwt.rst
Add a new section about Paxos state tables. Update all
references to system.paxos in the text to refer to this
section.
2025-07-24 20:04:43 +02:00
Petr Gusev
69017fb52a docs: fix tablets.rst
LWT and Alternator are now supported with tablets.
2025-07-24 20:04:43 +02:00
Petr Gusev
abab025d4f alternator: enable LWT 2025-07-24 20:04:43 +02:00
Petr Gusev
e4fba1adfe random_failures: enable execute_lwt_transaction
Fixes scylladb/scylladb#24502
2025-07-24 19:48:09 +02:00
Petr Gusev
84b74d6895 test_tablets_lwt: add test_paxos_state_table_permissions 2025-07-24 19:48:09 +02:00
Petr Gusev
c7cfba726d test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
This test checks that LWT for tablets requires raft-based
schema management.
2025-07-24 19:48:09 +02:00
Petr Gusev
529d2b949e test_tablets_lwt: test timeout creating paxos state table 2025-07-24 19:48:09 +02:00
Petr Gusev
a9ef221ae8 test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
The test checks that we correctly handle the case when the base table
is recreated during LWT execution.
2025-07-24 19:48:08 +02:00
Petr Gusev
e8e2419df6 test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
This test checks that the paxos state is preserved in case
of tablet rebuild. This happens e.g. when a node is lost
permanently and another node is started to replace it.
2025-07-24 19:48:08 +02:00
Petr Gusev
ff2c22ba6a test_tablets_lwt: migrate test_lwt_support_with_tablets
LWT is now supported for tablets, but this requires LWT_WITH_TABLETS
feature. We migrate the test so that it checks the error messages in
case the feature is not supported.
2025-07-24 19:48:08 +02:00
Petr Gusev
e0c4dc350c test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
This test verifies that Paxos state is correctly migrated when
the base table's tablet is migrated. This test fails if Paxos
state is stored in system.paxos, as the final Paxos read would
reflect conflicting outcomes from both prior LWT operations.
2025-07-24 19:48:08 +02:00
Petr Gusev
c11e1aef5c test_tablets_lwt: add simple test for LWT
We add/remove the base table several times to check that paxos state
table is properly recreated.
2025-07-24 19:48:08 +02:00
Petr Gusev
78aa36b257 check_internal_table_permissions: handle Paxos state tables
CDC and $paxos tables are managed internally by Scylla. Users are
already prohibited from running ALTER and DROP commands on CDC tables.
In this commit, we extend the same restrictions to $paxos tables to
prevent users from shooting themselves in the foot.

Other commands are generally allowed for CDC and $paxos tables. An
important distinction is that CDC tables are meant to be accessed
directly by users, so appropriate permissions must be set for
non-superusers. In contrast, $paxos tables are not intended for direct
access by users. Therefore, this commit explicitly disallows
non-superusers from accessing them. Superusers are still allowed
access for debugging and troubleshooting purposes.

Note that these restrictions apply even if explicit permissions have
been granted. For example, a non-superuser may be granted SELECT
permissions on a $paxos table, but the restriction above will
still take precedence. We don't try to restrict users
from giving permissions to $paxos tables for simplicity.
2025-07-24 19:48:08 +02:00
Petr Gusev
ec3c5f4cbc client_state: extract check_internal_table_permissions
This is a refactoring commit — it extracts the CDC permissions handling
logic into a separate function: check_internal_table_permissions.

This is a preparatory step for the next commit, where we'll handle
paxos state tables similarly to CDC tables.
2025-07-24 19:48:08 +02:00
Petr Gusev
bb4e7a669f paxos_store: handle base table removal
Subscribe to on_before_drop_column_family to drop the associated
Paxos state table when the corresponding user table is dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
1b70623908 database: get_base_table_for_tablet_colocation: handle paxos state table
We need to mark paxos state table as colocated with the user table, so
that the corresponding tablets are migrated/repaired together.
2025-07-24 19:48:08 +02:00
Petr Gusev
03aa2e4823 paxos_state: use node_local_only mode to access paxos state 2025-07-24 19:48:08 +02:00
Petr Gusev
ff1caa9798 query_options: add node_local_only mode
We want to access the paxos state table only on the local node and
shard (or shards in case of intranode_migration). In this commit we
add a node_local_only flag to query_options, which allows to do that.
This flag can be set for a query via make_internal_options.

We handle this flag on the statements layer by forwarding it to
either coordinator_query_options or coordinator_mutate_options.
2025-07-24 19:48:08 +02:00
Petr Gusev
65c7e36b7c storage_proxy: handle node_local_only in query
In this commit we support node_local_only flag in read code path in
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
2d747d97b8 storage_proxy: handle node_local_only in mutate
We add the remove_non_local_host_ids() helper, which
will be used in the next commit to support the read
path. HostIdVector concept is introduced to be able
to handle both host_id_vector_replica_set and
host_id_vector_topology_change uniformly.

The storage_proxy_coordinator_mutate_options class
is declared outside of storage_proxy to avoid C++
compiler complaints about default field initializers.
In particular, some storage_proxy methods use this
class for optional parameters with default values,
which is not allowed when the class is defined inside
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
7eb198f2cc storage_proxy: introduce node_local_only flag
Add a per-request flag that restricts query execution
to the local node by filtering out all non-local replicas.
Standard consistency level (CL) rules still apply:
if the local node alone cannot satisfy the
requested CL, an exception is thrown.

This flag is required for Paxos state access, where
reads and writes must target only the local node.

As a side effect, this also enables the implementation
of scylladb/scylladb#16478, which proposes a CQL
extension to expose 'local mode' query execution to users.

Support for this flag in storage_proxy's read and write
code paths will be added in follow-up commits.
2025-07-24 19:48:08 +02:00
Petr Gusev
8e745137de abstract_replication_strategy: remove unused using 2025-07-24 19:48:08 +02:00
Petr Gusev
4c1aca3927 storage_proxy: add coordinator_mutate_options
In upcoming commits, we want to add a node_local_only flag to both read
and write paths in storage_proxy. This requires passing the flag from
query_processor to the part of storage_proxy where replica selection
decisions are made.

For reads, it's sufficient to add the flag to the existing
coordinator_query_options class. For writes, there is no such options
container, so we introduce coordinator_mutate_options in this commit.

In the future, we may move some of the many mutate() method arguments
into this container to simplify the code.
2025-07-24 19:48:08 +02:00
Petr Gusev
b6ccaffd45 storage_proxy: rename create_write_response_handler -> make_write_response_handler
Most of the create_write_response_handler overloads follow the same
signature pattern to satisfy the sp::mutate_prepare call. The one which
doesn't follow it is invoked by others and is responsible for creating
a concrete handler instance. In this refactoring commit we rename
it to make_write_response_handler to reduce confusion.
2025-07-24 19:48:08 +02:00
Petr Gusev
db946edd1d storage_proxy: simplify mutate_prepare
This is a refactoring commit. We remove extra lambda parameters from
mutate_prepare since the CreateWriteHandler lambda can simply
capture them.

We can't std::move(permit) in another mutate_prepare overload,
because each handler wants its own copy of this pemit.
2025-07-24 19:48:08 +02:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Dawid Mędrek
b559c1f0b6 db/hints/manager.cc: Add logs for changing host filter
We add new logs when the host filter is undergoing a change. It should not
happen very often and so it shouldn't clog the log files. At the same
time, it provides us with useful information when debugging.
2025-07-24 17:45:34 +02:00
Dawid Mędrek
cb0cd44891 db/hints: Increase log level in critical functions
We increase the log level in more important functions to capture
more information about the behavior of hints. All of the promoted
logs are printed rarely, so they should not clog the log files, but
at the same time they provide more insight into what has already
happened and what has not.
2025-07-24 17:41:54 +02:00
Petr Gusev
3e0347c614 migration_manager: add timeout to start_group0_operation and announce
Pass a timeout parameter through to start_operation()
and add_entry(), respectively.

This is a preparatory change for the next commit, which
will use the timeout to properly handle timeouts during
lazy creation of Paxos state tables.
2025-07-24 16:39:50 +02:00
Petr Gusev
519f40a95e paxos_store: use non-internal queries
Switch paxos_store from using internal queries to regular prepared
queries, so that prepared statements are correctly updated when
the base table is recreated.

The do_execute_cql_with_timeout function is extracted to reduce
code bloat when execute_cql_with_timeout template function
is instantiated.

We change return type of execute_cql_with_timeout to untyped_result_set
since shared_ptr is not really needed here.
2025-07-24 16:39:50 +02:00
Petr Gusev
6caa1ae649 qp: make make_internal_options public
In upcoming commits, we will switch paxos_store from using internal
queries to regular prepared queries, so that prepared statements are
correctly updated when the base table is recreated. To support this,
we want to reuse the logic for converting parameters from
vector<data_value_or_unset> to raw_value_vector_with_unset.
This commit makes make_internal_options public to enable that reuse.
2025-07-24 16:39:50 +02:00
Petr Gusev
13f7266052 paxos_store: conditional cf_id filter
We want to reuse the same queries to access system.paxos and the the
co-located table. A separate co-located table will be created for each
user table, so we won't need cf_id filter for them. In this commit
we make cf_if filter optional and apply it only if the stable table
is actually system.paxos.
2025-07-24 16:39:50 +02:00
Petr Gusev
370f91adb7 paxos_store: coroutinize
This is another preparational step. We want to add more logic to
paxos_store state access functions in the next commits, it's easier
to do with coroutines.

Pass ballot by value to delete_paxos_decision because
paxos_state::prune is not a coroutine and the ballot parameter
is destroyed when we return from it. The alternative
solution -- pass by const reference to paxos_state::prune -- doesn't
work because paxos_state::prune is called
from a lambda in paxos_response_handler::prune, this lambda is
not a coroutine and the 'ballot' field could be destroyed along
with the body of this lambda as soon as we return from
paxos_state::prune.
2025-07-24 16:39:50 +02:00
Petr Gusev
ab03badc15 feature_service: add LWT_WITH_TABLETS feature
We will need this feature to determine if it's safe to enable
LWTs for a tablet-based table.
2025-07-24 16:39:50 +02:00
Petr Gusev
8292ecf2e1 paxos_state: inline system_keyspace functions into paxos_store
Prepares for reusing the same functions to access either
system.paxos or a co-located table.
2025-07-24 16:39:50 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00
Gleb Natapov
d5e023bbad topology coordinator: drop no longer needed token metadata barrier
Currently we do token metadata barrier before accepting a replacing
node. It was needed for the "replace with the same IP" case to make sure
old request will not contact new node by mistake. But now since we
address nodes by id this is no longer possible since old requests will
use old id and will be rejected.

Closes scylladb/scylladb#25047
2025-07-24 11:15:42 +02:00
Aleksandra Martyniuk
1767eb9529 repair: remove unused code 2025-07-24 11:11:12 +02:00
Aleksandra Martyniuk
a0031ad05e api: repair_async: forbid repairing tablet keyspaces
Return 403 Forbidden if a user tries to repair tablet keyspace with
/storage_service/repair_async/ API.
2025-07-24 11:11:09 +02:00
Tomasz Grabiec
c9bf010d6d Merge 'test.py: skip cleaning testlog' from Andrei Chekun
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run

Backport is not needed, since it framework enhancement.

Closes scylladb/scylladb#24838

* github.com:scylladb/scylladb:
  test.py: skip cleaning artifacts when -s provided
  test.py: move deleting directory to prepare_dir
2025-07-24 09:46:42 +03:00
Gleb Natapov
ab6e328226 storage_proxy: preallocate write response handler hash table
Currently it grows dynamically and triggers oversized allocation
warning. Also it may be hard to find sufficient contiguous memory chunk
after the system runs for a while. This patch pre-allocates enough
memory for ~1M outstanding writes per shard.

Fixes #24660
Fixes #24217

Closes scylladb/scylladb#25098
2025-07-24 09:46:42 +03:00
Patryk Jędrzejczak
f89ffe491a Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

Closes scylladb/scylladb#24714

* https://github.com/scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-24 09:46:42 +03:00
Michał Chojnowski
0ca983ea91 utils/bit_cast: add object_representation()
An util that casts a trivial object to the span of its bytes.
2025-07-23 17:03:05 +02:00
Patryk Jędrzejczak
f408d1fa4f docs: document the option to set recovery_leader later
In one of the previous commits, we made it possible to set
`recovery_leader` on each node just before restarting it. Here, we
update the corresponding documentation.
2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak
9e45e1159b test: delay setting recovery_leader in the recovery procedure tests
In the previous commit, we made it possible to set `recovery_leader`
on each node just before restarting it. Here, we change all the
tests of the Raft-based recovery procedure to use and test this option.
2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak
ba5b5c7d2f gossip: add recovery_leader to gossip_digest_syn
In the new Raft-based recovery procedure, live nodes join the new
group 0 one by one during a rolling restart. There is a time window when
some of them are in the old group 0, while others are in the new group
0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The
current solution for this problem is to ignore group 0 mismatches if
`recovery_leader` is set on the local node and to ask the administrator
to perform the rolling restart in the following way:
- set `recovery_leader` in `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- proceed with the rolling restart.

This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches
when exactly one of the two gossiping nodes has `recovery_leader` set.
We achieve this by adding `recovery_leader` to `gossip_digest_syn`.
This change makes setting `recovery_leader` earlier on all nodes and
reloading the config unnecessary. From now on, the administrator can
simply restart each node with `recovery_leader` set.

However, note that nodes that join group 0 must have `recovery_leader`
set until all nodes join the new group 0. For example, assume that we
are in the middle of the rolling restart and one of the nodes in the new
group 0 crashes. It must be restarted with `recovery_leader` set, or
else it would reject `gossip_digest_syn` messages from nodes in the old
group 0. To avoid problems in such cases, we will continue to recommend
setting `recovery_leader` in `scylla.yaml` instead of passing it as
a command line argument.
2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak
23f59483b6 db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
Currently, `peers_table_read_fixup` removes rows with no `host_id`, but
not with null `host_id`. Null host IDs are known to appear in system
tables, for example in `system.cluster_status` after a failed bootstrap.
We better make sure we handle them properly if they ever appear in
`system.peers`.

This commit guarantees that null UUID cannot belong to
`loaded_endpoints` in `storage_service::join_cluster`, which in
particular ensures that we throw a runtime error when a user sets
`recovery_leader` to null UUID during the recovery procedure. This is
handled by the code verifying that `recovery_leader` belongs to
`loaded_endpoints`.
2025-07-23 15:36:56 +02:00
Patryk Jędrzejczak
445a15ff45 db/config, gms/gossiper: change recovery_leader to UUID
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.

After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:

```
throw std::runtime_error(
        "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
        "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```
2025-07-23 15:36:56 +02:00
Patryk Jędrzejczak
ec69028907 db/config, utils: allow using UUID as a config option
We change the `recovery_leader` option to UUID in the following commit.
2025-07-23 15:36:45 +02:00
Gleb Natapov
ddc3b6dcf5 migration manager: assert that if schema pull is disabled the group0 is not in use_pre_raft_procedures state
If schema pull are disabled group0 is used to bring up to date schema
by calling start_group0_operation() which executes raft read barrier
internally, but if the group0 is still in use_pre_raft_procedures
start_group0_operation() silently does nothing. Later the code that
assumes that schema is already up-to-date will fail and print warnings
into the log. But since getting queries in the state when a node is in
raft enabled mode but group0 is still not configured is illegal it is
better to make those errors more visible buy asserting them during
testing.

Closes scylladb/scylladb#25112
2025-07-23 14:10:17 +02:00
Petr Gusev
41a67510bb Revert "main.cc: fix group0 shutdown order"
This reverts commit 6b85ab79d6.
2025-07-23 12:11:01 +02:00
Botond Dénes
b65a2e2303 Update seastar submodule
* seastar 26badcb1...60b2e7da (42):
  > Revert "Fix incorrect defaults for io queue iops/bandwidth"
  > fair_queue: Ditch queue-wide accumulator reset on overflow
  > addr2line, scripts/stall-analyser: change the default tool to llvm-addr2line
  > Fix incorrect defaults for io queue iops/bandwidth
  > core/reactor: add cxx_exceptions() getter
  > gate: make destructor virtual
  > scripts/seastar-addr2line: change the default addr2line utility to llvm-addr2line
  > coding-style: Align example return types
  > reactor: Remove min_vruntime() declaration
  > reactor: Move enable_timer() method to private section
  > smp: fix missing span include
  > core: Don't keep internal errors counter on reactor
  > pollable_fd: Untangle shutdown()
  > io_queue: Remove deprecated statistics getters
  > fair_queue: Remove queued/executing resource counters
  > reactor: Move set_current_task() from public reactor API
  > util: make SEASTAR_ASSERT() failure generate SIGABRT
  > core: fix high CPU use at idle on high core count machines
  > Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov
    fair_queue: Move io_throttler to io_queue.hh
    fair_queue: Move metrics from to io_queue::stream
    fair_queue: Remove io_throttler from tests
    fair_queue_test: Remove io-throttler from fair-queue
    fair_queue: Remove capacity getters
    fair_queue: Move grab_result into io_queue::stream too
    fair_queue: Move throtting code to io_queue.cc
    fair_queue: Move throttling code to io_queue::stream class
    fair_queue: Open-code dispatch_requests() into users
    fair_queue: Split dispatch_requests() into top() and pop_front()
    fair_queue: Swap class push back and dispatch
    fair_queue: Configure forgiving factor externally
    fair_queue: Move replenisher kick to dispatch caller
    io_queue: Introduce io_queue::stream
    fair_queue: Merge two grab_capacity overloads
    fair_queue: Detatch outcoming capacity grabbing from main dispatch loop
    fair_queue: Move available tokens update into if branch
    io_queue: Rename make_fair_group_config into configure_throttler
    io_queue: Rename get_fair_group into get_throttler
    fair_queue: Rename fair_group -> io_throttler
  > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
  > Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov
    reactor: Relax friendship with file_data_source_impl
    fstream: Use direct io_stats reference
  > thread_pool: Relax coupling with reactor
  > reactor: Mark some IO classes management methods private
  > http: Deprecate json_exception
  > io_tester: Collect and report disk queue length samples
  > test/perf: Add context-switch measurer
  > http/client: Zero-copy forward content-length body into the underlying stream
  > json2code: Genrate move constructor and move-assignment operator
  > Merge 'Semi-mixed mode for output_stream' from Pavel Emelyanov
    output_stream: Support semi-mixed mode writing
    output_stream: Complete write(temporary_buffer) piggy-back-ing write(packet)
    iostream: Add friends for iostream tests
    packet: Mark bool cast operator const
    iostream: Document output_stream::write() methods
  > io_tester: Show metrics about requests split
  > reactor: add counter for internal errors
  > iotune: Print correct throughput units
  > core: add label to io_threaded_fallbacks to categorize operations
  > slab: correct allocation logic and enforce memory limits
  > Merge 'Fix for non-json http function_handlers' from Travis Downs
    httpd_test: add test for non-JSON function handler
    function_handlers: avoid implicit conversions
    http: do not always treat plain text reply as json
  > Merge 'tls: add ALPN support' from Łukasz Kurowski
    tls: add server-side ALPN support
    tls: add client-side ALPN support
  > Merge 'coroutine: experimental: generator: implement move and swap' from Benny Halevy
    coroutine: experimental: generator: implement move and swap
    coroutine: experimental: generator: unconstify buffer capacity
  > future: downgrade asserts
  > output_stream: Remove unused bits
  > Merge 'Upstream a couple of minor reactor optimizations' from Travis Downs
    Match type for pure_check_for_work
    Do not use std::function for check_for_work()
  > Handle ENOENT in getgrnam

Includes scylla-gdb.py update by Pavel Emelyanov.

Closes scylladb/scylladb#25094
2025-07-22 18:19:58 +02:00
Pavel Emelyanov
2df1945f2a compaction: Pass "reason" to perform_task_on_all_files()
This tells "cleanup", "rewrite" and "split" reasons from each other

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-22 18:53:10 +03:00
Pavel Emelyanov
08c8c03a20 compaction: Pass "reason" to run_with_compaction_disabled()
This tells "cleanup" (done via try_perform_cleanup) and prepares the
ground for more callers (see next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-22 18:52:09 +03:00
Pavel Emelyanov
db46da45d2 compaction: Pass "reason" to stop_and_disable_compaction()
This tells "truncate" operation from other reasons

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-22 18:51:16 +03:00
Sergey Zolotukhin
e0dc73f52a storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665
2025-07-22 15:03:30 +02:00
Sergey Zolotukhin
bc934827bc test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665
2025-07-22 15:03:13 +02:00
Ran Regev
3d82b9485e docs: update nodetool restore documentation for --sstables-file-list
Fixes: #25128
A leftover from #25077

Closes scylladb/scylladb#25129
2025-07-22 14:43:35 +02:00
Yaron Kaikov
4445c11c69 ./github/workflows/conflict_reminder: improve workflow with weekly notifications
- Change schedule from twice weekly (Mon/Thu) to once weekly (Mon only)
- Extend notification cooldown period from 3 days to 1 week
- Prevent notification spam while maintaining immediate conflict detection on pushes

Fixes: https://github.com/scylladb/scylladb/issues/25130

Closes scylladb/scylladb#25131
2025-07-22 15:21:12 +03:00
Benny Halevy
fce6c4b41d tablets: prevent accidental copy of tablets_map
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.

Add clone() and clone_gently() methods to
allow explicit copies.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-22 15:07:26 +03:00
Benny Halevy
dee0d7ffbf locator: tablets: get rid of synchronous mutate_tablet_map
It is currently used only by tests that could very well
do with mutate_tablet_map_async.

This will simplify the following patch to prevent
accidental copy of the tablet_map, provding explicit
clone/clone_gently methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-22 15:03:02 +03:00
Avi Kivity
e4c4141d97 test.py: don't crash on early cleanup of ScyllaServer
If a test fails very early (still have to find why), test.py
crashes while flushing a non-existent log_file, as shown below.

To fix, initialize the property to None and check it during
cleanup.

```
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------

'ScyllaServer' object has no attribute 'log_file'
test_cluster_features Traceback (most recent call last):
  File "/home/avi/scylla-maint/./test.py", line 816, in <module>
    sys.exit(asyncio.run(main()))
             ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/home/avi/scylla-maint/./test.py", line 523, in main
    total_tests_pytest, failed_pytest_tests = await run_all_tests(signaled, options)
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 452, in run_all_tests
    failed += await reap(done, pending, signaled)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 418, in reap
    result = coro.result()
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 143, in run
    return await super().run(test, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/test/pylib/suite/base.py", line 216, in run
    await test.run(options)
  File "/home/avi/scylla-maint/test/pylib/suite/topology.py", line 48, in run
    async with get_cluster_manager(self.uname, self.suite.clusters, str(self.suite.log_dir)) as manager:
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/contextlib.py", line 221, in __aexit__
    await anext(self.gen)
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 2006, in get_cluster_manager
    await manager.stop()
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 1539, in stop
    await self.clusters.put(self.cluster, is_dirty=True)
  File "/home/avi/scylla-maint/test/pylib/pool.py", line 104, in put
    await self.destroy(obj)
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 65, in recycle_cluster
    srv.log_file.close()
    ^^^^^^^^^^^^
AttributeError: 'ScyllaServer' object has no attribute 'log_file'
```

Closes scylladb/scylladb#24885
2025-07-22 12:39:01 +02:00
Avi Kivity
2db2b42556 sstables: version: drop custom operator<=>
The default comparison for enums is equivalent and
sufficient.

Closes scylladb/scylladb#24888
2025-07-22 12:39:01 +02:00
Avi Kivity
e89f6c5586 config, main: make cpu scheduling mandatory
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.

Make it mandatory and mark the option unused.

Closes scylladb/scylladb#24894
2025-07-22 12:39:01 +02:00
Avi Kivity
ee138217ba alternator: simplify std::views::transform calls that extract a member from a class
Rather than calling std::views::transform with a lambda that extracts
a member from a class, call std::views::transform with a pointer-to-member
to do the same thing. This results in more concise code.

Closes scylladb/scylladb#25012
2025-07-22 12:39:01 +02:00
Jakub Smolar
6e0a063ce3 gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050
2025-07-22 12:39:01 +02:00
Nadav Har'El
298a0ec4de test/cqlpy: in README.md, remind users of run-cassandra to set NODETOOL
test/cqlpy/README.md explains how to run the cqlpy tests against
Cassandra, and mentions that if you don't have "nodetool" in your path
you need to set the NODETOOL variable. However, when giving a simple
example how to use the run-cassandra script, we forgot to remind the
user to set NODETOOL in addition to CASSANDRA, causing confusion for
users who didn't know why tests were failing.

So this patch fixes the section in test/cqlpy/README.md with the
run-cassandra example to also set the NODETOOL environment variable,
not just CASSANDRA.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25051
2025-07-22 12:39:00 +02:00
Aleksandra Martyniuk
b5026edf49 tasks: change _finished_children type
Parent task keeps a vector of statuses (task_essentials) of its finished
children. When the children number is large - for example because we
have many tables and a child task is created for each table - we may hit
oversize allocation while adding a new child essentials to the vector.

Keep task_essentails of children in chunked_vector.

Fixes: #25040.

Closes scylladb/scylladb#25064
2025-07-22 12:39:00 +02:00
Pavel Emelyanov
d94be313c1 Merge 'test: audit: ignore cassandra user audit logs in AUTH tests' from Andrzej Jackowski
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Additionally, this PR:
 - Adds missing `nonlocal new_rows` statement that prevented some checks from being called
 - Adds a testcase for audit logs of `cassandra` user

Fixes: https://github.com/scylladb/scylladb/issues/25069

Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`.

Closes scylladb/scylladb#25111

* github.com:scylladb/scylladb:
  test: audit: add cassandra user test case
  test: audit: ignore cassandra user audit logs in AUTH tests
  test: audit: change names of `filter_out_noise` parameters
  test: audit: add missing `nonlocal new_rows` statement
2025-07-22 10:42:16 +03:00
Pavel Emelyanov
295165d8ea Merge 's3_client: Enhance s3_client error handling' from Ernest Zaslavsky
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

Closes scylladb/scylladb#24883

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-22 10:40:39 +03:00
Ran Regev
dd67d22825 nodetool restore: sstable list from a file
Fixes: #25045

added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#25077
2025-07-22 09:11:02 +03:00
Pavel Emelyanov
52455f93b6 gms,init: Move get_disabled_features_from_db_config() from gms
Now when all callers are decoupled from gms config generating code, the
latter can be decoupled from the db::config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:20:17 +03:00
Pavel Emelyanov
8220974e76 code: Update callers generating feature service config
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:19:09 +03:00
Pavel Emelyanov
0808e65b4e gms: Make feature_config a simple struct
All config-s out there are plan structures without private members and
methods used to simply carry the set of config values around. Make the
feature service config alike.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:17:59 +03:00
Pavel Emelyanov
f703fb9b2d gms: Split feature_config_from_db_config() into two
The helper in question generates the disabled features set and assigns
one on the config. This patch detaches the features set generation into
an other function. The former will go away eventually and the latter
will be kept around main/test code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:16:40 +03:00
Ernest Zaslavsky
fc2c9dd290 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`
2025-07-21 17:03:23 +03:00
Ernest Zaslavsky
ba910b29ce s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.
2025-07-21 16:52:43 +03:00
Ernest Zaslavsky
b7ae6507cd s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
2025-07-21 16:49:47 +03:00
Ernest Zaslavsky
d53095d72f aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.
2025-07-21 16:42:44 +03:00
Andrzej Jackowski
21aedeeafb test: audit: add cassandra user test case
Audit tests use the `filter_out_noise` function to remove noise from
audit logs generated by user authentication. As a result, none of the
existing tests covered audit logs for the default `cassandra` user.
This change adds a test case for that user.

Refs: scylladb/scylladb#25069
2025-07-21 14:54:20 +02:00
Andrzej Jackowski
aef6474537 test: audit: ignore cassandra user audit logs in AUTH tests
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Fixes: scylladb/scylladb#25069
2025-07-21 14:54:20 +02:00
Andrzej Jackowski
daf1c58e21 test: audit: change names of filter_out_noise parameters
This is a refactoring commit that changes the names of the parameters
of the `filter_out_noise` function, as well as names of related
variables. The motiviation for the change is introduction of more
complex filtering logic in next commit of this patch series.

Refs: scylladb/scylladb#25069
2025-07-21 14:54:01 +02:00
Andrzej Jackowski
e634a2cb4f test: audit: add missing nonlocal new_rows statement
The variable `new_rows` was not updated by the inner function
`is_number_of_new_rows_correct` because the `nonlocal new_rows`
statement was missing. As a result, `sorted_new_rows` was empty and
certain checks were skipped.

This change:
 - Introduces the missing `nonlocal new_rows` declaration
 - Adds an assertion verifying that the number of new rows matches
   the expected count
 - Fixes the incorrect variable name in the lambda used for row sorting
2025-07-21 14:53:48 +02:00
Pavel Emelyanov
339f08b24a scripts: Enhance refresh_submodules.sh with nested summary
Currently when refreshing submodule, the script puts a plain list of
non-merge commits into commit message. The resulting summary contains
everything, but is hard to understand. E.g. if updating seastar today
the summary would start with

    * seastar 26badcb1...86c4893b (55):
      > util: make SEASTAR_ASSERT() failure generate SIGABRT
      > core: fix high CPU use at idle on high core count machines
      > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
      > reactor: Relax friendship with file_data_source_impl
      > fstream: Use direct io_stats reference
      > thread_pool: Relax coupling with reactor
      > reactor: Mark some IO classes management methods private
      > http: Deprecate json_exception
      > fair_queue: Move io_throttler to io_queue.hh
      > fair_queue: Move metrics from to io_queue::stream
      > fair_queue: Remove io_throttler from tests
      > fair_queue_test: Remove io-throttler from fair-queue
      > fair_queue: Remove capacity getters
      > fair_queue: Move grab_result into io_queue::stream too
      > fair_queue: Move throtting code to io_queue.cc
      > fair_queue: Move throttling code to io_queue::stream class
      > fair_queue: Open-code dispatch_requests() into users
      > fair_queue: Split dispatch_requests() into top() and pop_front()
      > fair_queue: Swap class push back and dispatch
      > fair_queue: Configure forgiving factor externally
      ...

That's not very informative, because the update includes several large
"merges" that have their summary which is missing here. This update
changes the way summary is generated to include merges and their
summaries and all merged commits are listed as sub-lines, like this

    * seastar 26badcb1...86c4893b (26):
      > util: make SEASTAR_ASSERT() failure generate SIGABRT
      > core: fix high CPU use at idle on high core count machines
      > Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov
        fair_queue: Move io_throttler to io_queue.hh
        fair_queue: Move metrics from to io_queue::stream
        fair_queue: Remove io_throttler from tests
        fair_queue_test: Remove io-throttler from fair-queue
        fair_queue: Remove capacity getters
        fair_queue: Move grab_result into io_queue::stream too
        fair_queue: Move throtting code to io_queue.cc
        fair_queue: Move throttling code to io_queue::stream class
        fair_queue: Open-code dispatch_requests() into users
        fair_queue: Split dispatch_requests() into top() and pop_front()
        fair_queue: Swap class push back and dispatch
        fair_queue: Configure forgiving factor externally
        fair_queue: Move replenisher kick to dispatch caller
        io_queue: Introduce io_queue::stream
        fair_queue: Merge two grab_capacity overloads
        fair_queue: Detatch outcoming capacity grabbing from main dispatch loop
        fair_queue: Move available tokens update into if branch
        io_queue: Rename make_fair_group_config into configure_throttler
        io_queue: Rename get_fair_group into get_throttler
        fair_queue: Rename fair_group -> io_throttler
      > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
      > Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov
        reactor: Relax friendship with file_data_source_impl
        fstream: Use direct io_stats reference
      > thread_pool: Relax coupling with reactor
      > reactor: Mark some IO classes management methods private
      ...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24834
2025-07-21 14:48:30 +03:00
Ernest Zaslavsky
0053a4f24a encryption: remove default case from component_type switch
Do not use default, instead list all fall-through components
explicitly, so if we add a new one, the developer doing that
will be forced to consider what to do here.

Eliminate the `default` case from the switch in
`encryption_file_io_extension::wrap_sink`, and explicitly
handle all `component_type` values within the switch statement.

fixes: https://github.com/scylladb/scylladb/issues/23724

Closes scylladb/scylladb#24987
2025-07-21 14:43:12 +03:00
Ernest Zaslavsky
408aa289fe treewide: Move misc files to utils directory
As requested in #22114, moved the files and fixed other includes and build system.

Moved files:
- interval.hh
- Map_difference.hh

Fixes: #22114

This is a cleanup, no need to backport

Closes scylladb/scylladb#25095
2025-07-21 11:56:40 +03:00
Piotr Dulikowski
7fd97e6a93 Merge 'cdc: Forbid altering columns of CDC log tables directly' from Dawid Mędrek
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

Closes scylladb/scylladb#25008

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-21 09:31:00 +02:00
Ran Regev
bb95ac857e enable_set: fix separator formatting from space comma to comma space
For better log readability.
Fixes: #23883

Closes scylladb/scylladb#24647
2025-07-20 19:12:57 +03:00
Avi Kivity
3dfdcf7d7a Merge 'transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

Closes scylladb/scylladb#24738

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-07-20 17:42:30 +03:00
Dawid Mędrek
59800b1d66 cdc: Forbid altering columns of inactive CDC log table
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.

Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.

Validation tests are provided.
2025-07-18 15:03:08 +02:00
Piotr Dulikowski
85e506dab5 Merge 'test.py: print warning when no tests found' from Andrei Chekun
Quit from the repeats if the test is under the pytest runner directory and has some typos or is absent. This allows not going several times through the discovery and stopping execution.
Print a warning at the end of the run when no tests were selected by provided name.

Fixes: scylladb/scylladb#24892

Closes scylladb/scylladb#24918

* github.com:scylladb/scylladb:
  test.py: print warning in case no tests were found
  test.py: break the loop when there is no tests for pytest
2025-07-18 10:26:44 +02:00
Piotr Dulikowski
fd6e14f3ab Merge 'cdc: throw error if column doesn't exist' from Michael Litvak
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

backport needed - simple fix for a node crash

Closes scylladb/scylladb#24986

* github.com:scylladb/scylladb:
  test: cdc: add test_cdc_with_alter
  cdc: throw error if column doesn't exist
2025-07-18 09:40:56 +02:00
Dawid Mędrek
bea7c26d64 test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra
We adjust test_create_role_with_hashed_password_authorization to work
with both Scylla and Cassandra. For some reason (probably a bug),
Cassandra requires that the `LOGIN` property of a role come before
the password.
2025-07-17 22:18:12 +02:00
Dawid Mędrek
55c22f864e test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra
Cassandra doesn't use service levels, and it doesn't include auth
in the output of `DESCRIBE SCHEMA`. It doesn't support the form of the
statement `... WITH PASSWORDS`. UDFs in Cassandra don't support Lua.
That's why the test didn't work against Cassandra.

In this commit, we adjust it to work with both Scylla and Cassandra.
2025-07-17 22:17:15 +02:00
Dawid Mędrek
fca03ca915 test/cqlpy/test_describe.py: Mark Scylla-only tests as such
Tests verifying that auth and service levels are part of the output
of `DESCRIBE SCHEMA` were not marked as `scylla_only` when they were
written, but they're a feature only Scylla has. Because of that, let's
mark them with `scylla_only` so they're not run against Cassandra to
avoid unnecessary failures. We also provide a short explanation for
each test why it's marked that way.
2025-07-17 21:45:44 +02:00
Andrei Chekun
04b0fba88c test.py: print warning in case no tests were found
Print a warning at the end of the run when no tests were selected by provided
name.

Fixes: https://github.com/scylladb/scylladb/issues/24892
2025-07-17 19:51:22 +02:00
Michael Litvak
86dfa6324f test: cdc: add test_cdc_with_alter
Add a test that tests adding and dropping a column to a table with CDC
enabled while writing to it.
2025-07-17 17:16:17 +02:00
Michael Litvak
b336f282ae cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952
2025-07-17 17:16:17 +02:00
Dario Mirovic
4a6f71df68 test/cqlpy: add cpp exception metric test conditions
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`

If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.
2025-07-17 17:02:48 +02:00
Dario Mirovic
5390f92afc transport/server: replace protocol_exception throws with returns
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.

In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.

Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.

Refs: #24567
2025-07-17 16:54:05 +02:00
Dario Mirovic
9f4344a435 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
2025-07-17 16:40:02 +02:00
Dario Mirovic
30d424e0d3 transport/server: avoid exception-throw overhead in handle_error
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).

The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.

Refs: #24567
2025-07-17 16:40:02 +02:00
Dario Mirovic
7aaeed012e test/cqlpy: add protocol_exception tests
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.

Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.

The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets

Refs: #24567
2025-07-17 16:39:54 +02:00
Petr Gusev
2027856847 Revert "paxos_state: read repair for intranode_migration"
This reverts commit 45f5efb9ba.

The load_and_repair_paxos_state function was introduced in
scylladb/scylladb#24478, but it has never been tested or proven useful.

One set of problems stems from its use of local data structures
from a remote shard. In particular, system_keyspace and schema_ptr
cannot be directly accessed from another shard — doing so is a bug.

More importantly, load_paxos_state on different shards can't ever
return different values. The actual shard from which data is read is
determined by sharder.shard_for_reads, and storage_proxy will jump
back to the appropriate shard if the current one doesn't match. This
means load_and_repair_paxos_state can't observe paxos state from
write-but-not-read shard, and therefore will never be able to
repair anything.

We believe this explicit Paxos state read-repair is not needed at all.

Any paxos state read which drives some paxos round forward is already
accompanied by a paxos state write. Suppose we wrote the state to the
old shard but not to the new shard (because of some error) while
streaming is already finished. The RPC call (prepare or accept) will
return error to the coordinator, such replica response won't affect
the current round. This write won't affect any subsequent paxos rounds
either, unless in those rounds the write actually succeeds on both
shards, effectively 'auto-repairing' paxos state.

Same if we managed to write to the new shard but not to the old shard.
Any subsequent reads will observe either the old state or the new
state (if the tablet already switched reads to the new shard). In any
case, we'll have to write the state to all relevant shards
from sharder.shard_for_writes (one or two) before sending rpc
response, making this state visible for all subsequent reads.

Thus, the monotonicity property ("once observed, the state must always
be observed") appears to hold without requiring explicit read-repair
and load_and_repair_paxos_state is not needed.

Closes scylladb/scylladb#24926
2025-07-17 14:00:43 +02:00
Botond Dénes
20693edb27 Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).

In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.

This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.

Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.

No backports needed, this is a preparation for new functionality.

Closes scylladb/scylladb#25000

* github.com:scylladb/scylladb:
  sstables: add sstable::make_index_reader() and use where appropriate
  sstables/mx: in readers, use abstract_index_reader instead of index_reader
  sstables: in validate(), use abstract_index_reader instead of index_reader where possible
  test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
  sstables/index_reader: introduce abstract_index_reader
  sstables/index_reader: extract a prefetch_lower_bound() method
2025-07-17 14:32:08 +03:00
Nadav Har'El
04b263b51a Merge 'vector_index: do not create a view when creating a vector index' from Michał Hudobski
This PR adds a way for custom indexes to decide whether a view should be created for them, as for the vector_index the view is not needed, because we store it in the external service. To allow this, custom logic for describing indexes using custom classes was added (as it used to depend on the view corresponding to an index).

Fixes: VECTOR-10

Closes scylladb/scylladb#24438

* github.com:scylladb/scylladb:
  custom_index: do not create view when creating a custom index
  custom_index: refactor describe for custom indexes
  custom_index: remove unneeded duplicate of a static string
2025-07-17 13:48:49 +03:00
Michał Chojnowski
4e4a4b6622 sstables: add sstable::make_index_reader() and use where appropriate
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.

We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.

So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
1c4065e7dd sstables/mx: in readers, use abstract_index_reader instead of index_reader
This makes clear which methods of index_reader are available for use
by sstable readers, and which aren't.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
efcf3f5d66 sstables: in validate(), use abstract_index_reader instead of index_reader where possible
After we add a second index implementation, we will probably want to
adjust validate() to work with either implementation.

Some validations will be format-specific, but some will be common.
For now, let's use abstract_index_reader for the validations which
can be done through that interface, and let's have downcast-specific
codepaths for the others.

Note: we change a `get_data_file_position()` call to `data_file_positions().start`.
The call happens at the beginning of a partition, and at this points
these two expressions are supposed to be equivalent.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
92219a5ef8 test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
We don't want tests to create the concrete `index_reader` directly. We
would like them to be able to test both sstables which use
`index_reader`, and those which will use the planned new index implementation.
So we will let the tests construct an abstract_index_reader and pass it
to the index_reader_assertions, which will be able to assert the requested
properties on various implementations as it wants.
2025-07-17 10:32:56 +02:00
Michał Chojnowski
c052ccd081 sstables/index_reader: introduce abstract_index_reader
We want to implement BTI indexes in Scylla.
After we do that, some sstables will use a BTI index reader,
while others will use the old BIG index reader.
To handle that, we can expose a common virtual "index reader"
interface to sstable readers. This is what this patch does.

This interface can't be quite fully implemented by a BTI index,
because some methods returns keys which a BIG index stores,
but a BTI index doesn't. So it will be further restricted in future
patches. But for now, we only extract *all* methods currently
used by the readers to a virtual interface.
2025-07-17 10:32:56 +02:00
Botond Dénes
fd6877c654 Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

Closes scylladb/scylladb#24480

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-17 11:30:40 +03:00
Calle Wilund
5dd871861b tests::proc::process_fixture: Fix line handler adaptor buffering
Fixes #24998

Helper routine translating input_stream buffers to single lines
did not loop over current buffer state, leading to only the first
line being sent to end listener.

Rewrote to use range iteration instead. Nicer.

Closes scylladb/scylladb#24999
2025-07-17 10:58:03 +03:00
Ernest Zaslavsky
342e94261f s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990
2025-07-17 10:55:04 +03:00
Botond Dénes
054ea54565 Merge 'streaming: Avoid deadlock by running view checks in a separate scheduling group' from Tomasz Grabiec
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes #24807
Fixes #24925

Closes scylladb/scylladb#24929

* github.com:scylladb/scylladb:
  streaming: Avoid deadlock by running view checks in a separate scheduling group
  service: migration_manager: Run group0 barrier in gossip scheduling group
2025-07-17 10:24:41 +03:00
Botond Dénes
4c832d583e Merge 'repair: Speed up ranges calculation when small table optimization is on' from Asias He
repair: Speed up ranges calculation when small table optimization is on

Normally, during bootstrap, in repair_service::bootstrap_with_repair, we
need to calculate which range to sync data from carefully for the new
node. With small table optimization on, we pass a single full range and
all peer nodes to row level repair to sync data with. Now that we only
need to pass a single range and full peers, there is no need to calculate
the ranges and peers in repair_service::bootstrap_with_repair and drop
it later. The calculation takes time which slows down bootstrap, e.g.,

```
Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - bootstrap_with_repair: started with
keyspace=system_distributed_everywhere, nr_ranges=23809

Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]:
sync data for keyspace=system_distributed_everywhere, status=started,
reason=bootstrap, small_table_optimization=true
```

The range calculation took 15 seconds for system_distributed_everywhere
table.

To fix, the ranges calculation is skipped if small table optimization is
on for the keyspace.

Before:
cluster    dev   [ PASS ] cluster.test_boot_nodes.1 104.59s

After:
cluster    dev   [ PASS ] cluster.test_boot_nodes.1 89.23s

A 15% improvement to bootstrap 30 node cluster was observed.

Fixes #24817

Closes scylladb/scylladb#24901

* github.com:scylladb/scylladb:
  repair: Speed up ranges calculation when small table optimization is on
  test: Add test_boot_nodes.py
2025-07-17 10:23:45 +03:00
Nikos Dragazis
88554b7c7a docs: Document the Azure Key Provider
Extend the EaR ops guide to incorporate the new Azure Key Provider.
Document its options and provide instructions on how to configure it.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 23:06:11 +03:00
Nikos Dragazis
09dcdebca3 test: Add tests for Azure Key Provider
The tests cover a variety of scenarios, including:

* Authentication with client secrets, client certificates, and IMDS.
* Valid and invalid encryption options in the configuration and table
  schema.
* Common error conditions such as insufficient permissions, non-existent
  keys and network errors.

All tests run against a local mock server by default. A subset of the
tests can also against real Azure services if properly configured. The
tests that support real Azure services were kept to a minimum to cover
only the most basic scenarios (success path and common error
conditions).

Running the tests with real resources requires parameterizing them with
env vars:
* ENABLE_AZURE_TEST - set to non-zero (1/true) to run Azure tests (enabled by default)
* ENABLE_AZURE_TEST_REAL - set to non-zero (1/true) to run against real Azure services
* AZURE_TENANT_ID - the tenant where the principals live
* AZURE_USER_1_CLIENT_ID - the client ID of user1
* AZURE_USER_1_CLIENT_SECRET - the secret of user1
* AZURE_USER_1_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user1
* AZURE_USER_2_CLIENT_ID - the client ID of user2
* AZURE_USER_2_CLIENT_SECRET - the secret of user2
* AZURE_USER_2_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user2
* AZURE_KEY_NAME - set to <vault_name>/<keyname>

User1 is assumed to have permissions to wrap/unwrap using the given key.
User2 is assumed to not have permissions for these operations.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 23:06:01 +03:00
Nikos Dragazis
083aabe0c6 pylib: Add mock server for Azure Key Vault
The Azure Key Provider depends on three Azure services:

- Azure Key Vault
- IMDS
- Entra STS

To enable local testing, introduce a mock server that offers all the
needed APIs from these services. The server also offers an error
injection endpoint to configure a particular service to respond with
some error code for a number of consecutive requests.

The server is integrated as a 3rd party service in test.py.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
41b63469e1 encryption: Define and enable Azure Key Provider
Define the Azure Key Provider to connect the core EaR business logic
with the Azure-based Key Management implementation (Azure host).

Introduce "AzureKeyProviderFactory" as a new `key_provider` value in the
configuration.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
f0927aac07 encryption: azure: Delegate hosts to shard 0
As in the AWS and GCP hosts, make all Azure hosts delegate their traffic
to shard 0 to avoid creating too many data encryption keys and API
calls to Key Vault.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
339992539d encryption: Add Azure host cache
The encryption context maintains a cache per host type per thread.
Add a cache for the Azure host as well. Initialize the cache with Azure
hosts from the configuration, while registering the extensions for
encryption.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
c98d3246b2 encryption: Add config options for Azure hosts
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
a1aef456ac encryption: azure: Add override options
Extend `get_or_create_key()` to accept host options that override the
config options. This will be used to pass encryption options from the
table schema. Currently, only the master key can be overridden.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Nikos Dragazis
5ba6ca0992 encryption: azure: Add retries for transient errors
Inject a few fast retries to quickly recover from short-lived transient
errors. If a request is unauthorized, retry with no delay, since it may
be caused by expired tokens.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
d4dcdcd46c encryption: azure: Implement init()
Implement the `azure_host::init()` API that performs the async
initialization of the host.

Since the Azure host has no state that needs to be initialized, just
verify that we have access to the Vault key. This will cause the system
to fail earlier if not properly configured (e.g., the key does not
exist, the credentials have insufficient permissions, etc.).

Do not run any verification steps if no master key is configured in
`scylla.yaml`. The master key can be specified later or overridden
through the encryption options in table schema.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
1e519ba329 encryption: azure: Implement get_key_by_id()
Implement the `azure_host::get_key_by_id()` API, which retrieves a data
encryption key from a key ID.

Use a loading cache to reduce the API calls to Key Vault. When the cache
needs to refresh or reload a key, extract the ciphertext from the key ID
and unwrap it with the Vault key that is also encoded in the key ID.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
7938096142 encryption: azure: Add id-based key cache
Add a cache to store data encryption keys based on their IDs. This will
be plugged into `get_key_by_id()` in a later patch to avoid unwrapping
keys that have been encountered recently, thereby reducing the API calls
to Key Vault.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
470513b433 encryption: azure: Implement get_or_create_key()
Implement the `azure_host::get_or_create_key()` API, which returns a
data encryption key for a given algorithm descriptor (cipher algorithm
and key length).

Use a loading cache to reduce the API calls to Key Vault. When the cache
needs to refresh or reload a key, always create a new one and wrap it
with the Vault key.

For the REST API calls to Key Vault, use an ephemeral HTTP client and
configure it to not wait for the server's response when terminating a
TLS connection. Although the TLS protocol requires clients to wait on
the server's response to a close_notify alert, the Key Vault service
ignores this, causing the client to block for 10 seconds (hardcoded)
before timing out.

Use the following identifier for each key:
<vault name>/<key name>/<key version>:<base64 encoded ciphertext of data encryption key>

The key version is required to support Vault key rotations.

Finally, define an exception for Vault errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
e76187fb6d encryption: azure: Add credentials in Azure host
The Azure host needs credentials to communicate with Key Vault.

First search for credentials in the host options, and then fall back to
default credentials if the former are non-existent or incomplete.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
457c90056d encryption: azure: Add attribute-based key cache
Add a cache to store data encryption keys based on their attributes
(cipher algorithm + key length). This will be plugged into
`get_or_create_key()` in a later patch to reuse the same keys in
multiple requests, thereby reducing the API calls to Key Vault.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
b39d1b195e encryption: azure: Add skeleton for Azure host
The Azure host manages cryptographic keys using Azure Key Vault.

This patch only defines the API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
e078abba57 encryption: Templatize get_{kmip,kms,gcp}_host()
For deduplication.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
b1e719c531 encryption: gcp: Fix typo in docstring
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
eec49c4d78 utils: azure: Get access token with default credentials
Attempt to detect credentials from the system.

Inspired from the `DefaultAzureCredential` in the Azure C++ SDK, this
credential type detects credentials from the following sources (in this
order):

* environment variables (SP credentials - same variables as in Azure C++ SDK)
* Azure CLI
* IMDS

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
937d6261c0 utils: azure: Get access token from Azure CLI
Implement token request with Azure CLI.

Inspired from the Azure C++ SDK's `AzureCliCredential`, this credential
type attempts to run the Azure CLI in a shell and parse the token from
its output. This is meant for development purposes, where a user has
already installed the Azure CLI and logged in with their user account.

Pass the following environment to the process:
* PATH
* HOME
* AZURE_CONFIG_DIR

Add a token factory to construct a token from the process output. Unlike
in Azure Entra and IMDS, the CLI's JSON output does not contain
'expires_in', and the token key is in camel case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
52a4bd83d5 utils: azure: Get access token from IMDS
Implement token request from IMDS.

No credentials are required for that - just a plain HTTP request on the
IMDS token endpoint.

Since the IMDS endpoint is a raw IP, it's not possible to reliably
determine whether IMDS is accessible or not (i.e., whether the node is
an Azure VM). Azure provides no node-local indication either. In lack of
a better choice, attempt to connect and declare failure if the
connection is not established within 3 seconds. Use a raw TCP socket for
this check, as the HTTP client currently lacks timeout or cancellation
support. Perform the check only once, during the first token refresh.

For the time being, do not support nodes with multiple user-assigned
managed identities. Expect the token request to fail in this case (IMDS
requires the identifier of the desired Managed Identity).

Add a token factory to correctly parse the HTTP response. This addresses
a discrepancy between token requests on IMDS and Azure Entra - the
'expires_in' field is a string in the former and an integer in the
latter.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
919765fb7f utils: azure: Get access token with SP certificate
Implement token request for Service Principals with a certificate.

The request is the same as with a secret, except that the secret is
replaced with an assertion. The assertion is a JWT that is signed with
the certificate.

To be consistent with the Azure C++ SDK, expect the certificate and the
associated private key to be encoded in PEM format and be provided in a
single file.

The docs suggest using 'PS256' for the JWT's 'alg' claim. Since this is
not supported by our current JWT library (jwt-cpp), use 'RS256' instead.

The JWT also requires a unique identifier for the 'jti' claim. Use a
random UUID for that (it should suffice for our use cases).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
a671530af6 utils: azure: Get access token with SP secret
Implement token request for Service Principals with a secret.

The token request requires a TLS connection. When closing the
connection, do not wait for a response to the TLS `close_notify` alert.
Azure's OAuth server would ignore it and the Seastar `connected_socket`
would hang for 10 seconds.

Add log redaction logic to not expose sensitive data from the request
and response payloads.

Add a token factory to parse the HTTP response. This cannot be shared
with other credential types because the JSON format is not consistent.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
66c8ffa9bf utils: rest: Add interface for request/response redaction logic
The rest http client, currently used by the AWS and GCP key providers,
logs the HTTP requests and responses unaltered. This causes some
sensitive data to be exposed (plaintext data encryption keys,
credentials, access tokens).

Add an interface to optionally redact any sensitive data from HTTP
headers and payloads.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
0d0135dc4c utils: azure: Declare all Azure credential types
The goal is to mimic the Azure C++ SDK, which offers a variety of
credentials, depending on their type and source.

Declare the following credentials:
* Service Principal credentials
* Managed Identity credentials
* Azure CLI credentials
* Default credentials

Also, define a common exception for SP and MI credentials which are
network-based.

This patch only defines the API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
3c4face47b utils: azure: Define interface for Azure credentials
Azure authentication is token based - the client obtains an access token
with their credentials, and uses it as a bearer token to authorize
requests to Azure services.

Define a common API for all credential types. The API will consist of a
single `get_access_token()` function that will be returning a new or a
cached access token for some resource URI (defines token scope).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
57bc51342e utils: Introduce base64url_{encode,decode}
Add helpers for base64url encoding.

base64url is a variant of base64 that uses a URL-safe alphabet. It can
be constructed from base64 by replacing the '+' and '/' characters with
'-' and '_' respectively. Many implementations also strip the padding,
although this is not required by the spec [1].

This will be used in upcoming patches for Azure Key Vault requests that
require base64url-encoded payloads.

[1] https://datatracker.ietf.org/doc/html/rfc4648#section-5

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Dawid Mędrek
20d0050f4e cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643
2025-07-16 15:35:48 +02:00
Patryk Jędrzejczak
a654101c40 Merge 'test.py: add missed parameters that should be passed from test.py to pytest' from Andrei Chekun
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`

Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.

Fixes: https://github.com/scylladb/scylladb/issues/24927

Closes scylladb/scylladb#24928

* https://github.com/scylladb/scylladb:
  test.py: add bypassing x_log2_compaction_groups to boost tests
  test.py: add bypassing random seed to boost tests
2025-07-16 15:29:17 +02:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Asias He
6c49b7d0ce repair: Speed up ranges calculation when small table optimization is on
Normally, during bootstrap, in repair_service::bootstrap_with_repair, we
need to calculate which range to sync data from carefully for the new
node. With small table optimization on, we pass a single full range and
all peer nodes to row level repair to sync data with. Now that we only
need to pass a single range and full peers, there is no need to calculate
the ranges and peers in repair_service::bootstrap_with_repair and drop
it later. The calculation takes time which slows down bootstrap, e.g.,

```
Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - bootstrap_with_repair: started with
keyspace=system_distributed_everywhere, nr_ranges=23809

Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]:
sync data for keyspace=system_distributed_everywhere, status=started,
reason=bootstrap, small_table_optimization=true
```

The range calculation took 15 seconds for system_distributed_everywhere
table.

To fix, the ranges calculation is skipped if small table optimization is
on for the keyspace.

Before:
cluster    dev   [ PASS ] cluster.test_boot_nodes.1 104.59s

After:
cluster    dev   [ PASS ] cluster.test_boot_nodes.1 89.23s

A 15% improvement to bootstrap 30 node cluster was observed.

Fixes #24817
2025-07-16 15:33:15 +08:00
Piotr Dulikowski
a14b7f71fe auth: fix crash when migration code runs parallel with raft upgrade
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).

The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.

While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.

Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.

Fixes: scylladb/scylladb#24975

Closes scylladb/scylladb#24976
2025-07-16 10:22:48 +03:00
Michał Chojnowski
1e7a292ef4 sstables/index_reader: extract a prefetch_lower_bound() method
The sstable reader reaches directly for a `clustered_index_cursor`.
But a BTI index reader won't be able to implement
`clustered_index_cursor`, because a BTI index doesn't store
full clustering keys, only some trie-encoded prefixes.

So we want to weaken the dependency. Instead of reaching
for `clustered_index_cursor`, we add a method which expresses
our intent, and we let `index_reader` touch the cursor internally.
2025-07-16 00:13:20 +02:00
Andrzej Jackowski
77a9b5919b main: utils: add thread names to alien workers
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
2025-07-15 23:29:21 +02:00
Andrzej Jackowski
9574513ec1 auth: move passwords::check call to alien thread
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.

To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.

Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.

Fixes scylladb/scylladb#24524
2025-07-15 23:29:13 +02:00
Andrzej Jackowski
4ac726a3ff test: wait for 3 clients with given username in test_service_level_api
test_service_level_api tests create a new session and wait for all
clients to authenticate. However, the check that all connections are
authenticated is done by verifying that there are no connections
with the username 'anonymous', which is insufficient if new connections
have not yet been listed.

To avoid test failures, this commit introduces an additional check that
verifies all expected clients are present in the system.clients table
before proceeding with the test.
2025-07-15 23:28:39 +02:00
Andrzej Jackowski
8d398fa076 auth: refactor password checking in password_authenticator
This commit splits an if statement to two ifs, to make it possible
to call `password::check` function from another (alien) thread in
the next commit of this patch series.

Ref. scylladb/scylladb#24524
2025-07-15 23:28:39 +02:00
Andrzej Jackowski
b3c6af3923 auth: make SHA-512 the only password hashing scheme for new passwords
Before this change, the following hashing schemes were supported by
`identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512,
SHA-256, and MD5. The reason for this was that the `crypt_r` function
used for password hashing comes from an external library (currently
`libxcrypt`), and the supported hashing algorithms vary depending
on the library in use.

However:
 - The bcrypt algorithms do not work because their scheme
   prefix lacks the required round count (e.g., it is `$2y$` instead of
   `$2y$05$`). We suspect this never worked as intended. Moreover,
   bcrypt tends to be slower than SHA-512, so we do not want to fix the
   prefix and start using it.
 - SHA-256 and SHA-512 are both part of the SHA-2 family, and libraries
   that support one almost always support the other. It is not expected
   to find a library that supports only SHA-256 but not SHA-512.
 - MD5 is not considered secure for password hashing.

Therefore, this commit removes support for bcrypt_y, bcrypt_a, SHA-256,
and MD5 for hashing new passwords to ensure that the correct hashing
function (SHA-512) is used everywhere.

This commit does not change the behavior of `passwords::check`, so
it is still possible to use passwords hashed with the removed
algorithms.

Ref. scylladb/scylladb#24524
2025-07-15 23:28:33 +02:00
Andrzej Jackowski
62e976f9ba auth: whitespace change in identify_best_supported_scheme()
Remove tabs in `identify_best_supported_scheme()` to facilitate
reuse of those lines after the for loop is removed. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.

Ref. scylladb/scylladb#24524
2025-07-15 20:26:39 +02:00
Andrzej Jackowski
b20aa7b5eb auth: require scheme as parameter for generate_salt
This is a refactoring commit that changes the `generate_salt` function
to require a password hashing scheme as a parameter. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.

Ref. scylladb/scylladb#24524
2025-07-15 20:26:39 +02:00
Andrzej Jackowski
c4e6d9933d auth: check password hashing scheme support on authenticator start
This commit adds a check to the `password_authenticator` to ensure
that at least one of the available password hashing schemes is
supported by the current environment. It is better to fail at system
startup rather than on the first attempt to use the password
authenticator. This change is motivated by the upcoming removal
of support for obsolete password hashing schemes and removal of
`identify_best_supported_scheme()` function.

Ref. scylladb/scylladb#24524
2025-07-15 20:26:33 +02:00
Botond Dénes
a26b6a3865 Merge 'storage: add make_data_or_index_source to the storages' from Ernest Zaslavsky
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

No backport needed since it enhances functionality which has not been released yet

fixes: https://github.com/scylladb/scylladb/issues/22458

Closes scylladb/scylladb#23695

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-15 13:32:13 +03:00
Andrei Chekun
a8fd38b92b test.py: skip discovery when combined_test binary absent
To discover what tests are included into combined_tests, pytest check this at
the very beginning. In the case if combined_tests binary is missing, it will
fail discovery and will not run test, even when it was not included into
combined_tests. This PR changes behavior, so it will not fail when
combined_tests is missing and only fail in case someone tries to run test from
it.

Closes scylladb/scylladb#24761
2025-07-15 09:49:02 +02:00
Ernest Zaslavsky
8d49bb8af2 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
2025-07-15 10:10:23 +03:00
Ernest Zaslavsky
dff9a229a7 sstables: refactor readers and sources to use coroutines
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
2025-07-15 10:10:23 +03:00
Pavel Emelyanov
4debe3af5d scylla-gdb: Don't show io_queue executing and queued resources
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24835
2025-07-15 07:41:20 +03:00
Botond Dénes
641a907b37 Merge 'test/alternator: clean up write isolation default and add more tests for the different modes' from Nadav Har'El
In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (`only_rmw_uses_lwt`) while the manual test/alternator/run used a different write isolation mode (`always_use_lwt`). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year.

But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes?

The honest answer is that running **all tests** with **all combinations of options** is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had **some** tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters.

Fixes #24442

No need to backport this patch - it's just adding more tests and changing developer-only test behavior.

Closes scylladb/scylladb#24493

* github.com:scylladb/scylladb:
  test/alternator: make "run" script use only_rmw_uses_lwt
  test/alternator: improve tests for write isolation modes
2025-07-15 07:16:18 +03:00
Patryk Jędrzejczak
21edec1ace test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
  responding before it applied mutations from any of the writes.

We fix the test by not expecting reads with CL=ONE to return a row.

We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.

Fixes scylladb/scylladb#22967

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#23518
2025-07-15 07:14:09 +03:00
Botond Dénes
2d3965c76e Merge 'Reduce Alternator table name length limit to 192 and fix crash when adding stream to table with very long name' from Nadav Har'El
Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways:

1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash.
2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters).

No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names.

Fixes #24598

Closes scylladb/scylladb#24717

* github.com:scylladb/scylladb:
  alternator: lower maximum table name length to 192
  alternator: don't crash when adding Streams to long table name
  alternator: split length limit for regular and auxiliary tables
  alternator: avoid needlessly validating table name
2025-07-15 06:57:04 +03:00
Botond Dénes
26f135a55a Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund
Fixes #24873

In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.

This just makes sure `release`  closes the connection if it neither retains or caches it.

Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.

This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.

Closes scylladb/scylladb#24874

* github.com:scylladb/scylladb:
  encryption_test: Make PyKMIP run under seastar::experimental::process
  test/lib: Add wrapper helper for test process fixtures
  kmip_host: Close connections properly if dropped by pool being full
  encryption_at_rest_test: Do port check using TLS
2025-07-15 06:55:34 +03:00
Botond Dénes
1f9f43d267 Merge 'kms_host: Support external temporary security credentials' from Nikos Dragazis
This PR extends the KMS host to support temporary AWS security credentials provided externally via the Scylla configuration file, environment variables, or the AWS credentials file.

The KMS host already supports:
* Temporary credentials obtained automatically from the EC2 instance metadata service or via IAM role assumption.
* Long-term credentials provided externally via configuration, environment, or the AWS credentials file.

This PR is about temporary credentials that are external, i.e., not generated by Scylla. Such credentials may be issued, for example, through identity federation (e.g., Okta + gimme-aws-creds).

External temporary credentials are useful for short-lived tasks like local development, debugging corrupted SSTables with `scylla-sstable`, or other local testing scenarios. These credentials are temporary and cannot be refreshed automatically, so this method is not intended for production use.

Documentation has been updated to mention these additional credential sources.

Fixes #22470.

New feature, no backport is needed.

Closes scylladb/scylladb#22465

* github.com:scylladb/scylladb:
  doc: Expose new `aws_session_token` option for KMS hosts
  kms_host: Support authn with temporary security credentials
  encryption_config: Mention environment in credential sources for KMS
2025-07-15 06:45:39 +03:00
Jenkins Promoter
41bc6a8e86 Update pgo profiles - x86_64 2025-07-15 04:54:17 +03:00
Jenkins Promoter
b86674a922 Update pgo profiles - aarch64 2025-07-15 04:49:45 +03:00
Nadav Har'El
a248336e66 alternator: clean up by co-routinizing
Reviewers of the previous patch complained on some ugly pre-existing
code in alternator/executor.cc, where returning from an asynchronous
(future) function require lengthy verbose casts. So this patch cleans
up a few instances of these ugly casts by using co_return instead of
return.

For example, the long and verbose

    return make_ready_future<executor::request_return_type>(
        rjson::print(std::move(response)));

can be changed to the shorter and more readable

    co_return rjson::print(std::move(response));

This patch should not have any functional implications, and also not any
performance implications: I only coroutinized slow-path functions and
one function that was already "partially" coroutinized (and this was
expecially ugly and deserved being fixed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:35 +03:00
Nadav Har'El
13ec94107a alternator: avoid spamming the log when failing to write response
Both make_streamed() and new make_streamed_with_extra_array() functions,
used when returning a long response in Alternator, would write an error-
level log message if it failed to write the response. This log message
is probably not helpful, and may spam the log if the application causes
repeated errors intentionally or accidentally.

So drop these log messages. The exception is still thrown as usual.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:34 +03:00
Nadav Har'El
d8fab2a01a alternator: clean up and simplify request_return_type
The previous patch introduced a function make_streamed_with_extra_array
which was a duplicate of the existing make_streamed. Reviewers
complained how baroque the new function is (just like the old function),
having to jump through hoops to return a copyable function working
on non-copyable objects, making strange-named copies and shared pointers
of everything.

We needed to return a copyable function (std::function) just because
Alternator used Seastar's json::json_return_type in the return type
from executor function (request_return_type). This json_return_type
contained either a sstring or an std::function, but neither was ever
really appropriate:

  1. We want to return noncopyable_function, not an std::function!
  2. We want to return an std::string (which rjson::print()) returns,
     not an sstring!

So in this patch we stop using seastar::json::json_return_type
entirely in Alternator.

Alternator's request_return_type is now an std::variant of *three* types:
  1. std::string for short responses,
  2. noncopyable_function for long streamed response
  3. api_error for errors.

The ugliest parts of make_streamed() where we made copies and shared
pointers to allow for a copyable function are all gone. Even nicer, a
lot of other ugly relics of using seastar::json_return_type are gone:

1. We no longer need obscure classes and functions like make_jsonable()
   and json_string() to convert strings to response bodies - an operation
   can simply return a string directly - usually returning
   rjson::print(value) or a fixed string like "" and it just works.

2. There is no more usage of seastar::json in Alternator (except one
   minor use of seastar::json::formatter::to_json in streams.cc that
   can be removed later). Alternator uses RapidJSON for its JSON
   needs, we don't need to use random pieces from a different JSON
   library.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-14 18:41:34 +03:00
Nadav Har'El
2385fba4b6 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535
2025-07-14 18:41:34 +03:00
Patryk Jędrzejczak
145a38bc2e Merge 'raft: fix voter assignment of transitioning nodes' from Emil Maskovsky
Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters.

This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments.

This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality.

If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency.

This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for.

Fixes: scylladb/scylladb#24420

No backport: Fix of a theoretical bug + CI stability improvement (we can backport eventually later if we see hits in branches)

Closes scylladb/scylladb#24843

* https://github.com/scylladb/scylladb:
  raft: fix voter assignment of transitioning nodes
  raft: improve comments in group0 voter handler
2025-07-14 16:12:03 +02:00
Calle Wilund
722e2bce96 encryption_test: Make PyKMIP run under seastar::experimental::process
Removes the requirement of boost::process, and all its non-seastar-ness.
Hopefully also makes the IO and shutdown handling a bit more reliable.
2025-07-14 12:18:16 +00:00
Calle Wilund
253323bb64 test/lib: Add wrapper helper for test process fixtures
Adds a wrapper for seastar::experimental::process, to help
use external process fixtures in unit test. Mainly to share
concepts such as line reading of stdout/err etc, and sync
the shutdown of these. Also adds a small path searcher to
find what you want to run.
2025-07-14 12:18:16 +00:00
Yaron Kaikov
fdcaa9a7e7 dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916
2025-07-14 11:20:41 +02:00
Benny Halevy
692b79bb7d compaction: get_max_purgeable_timestamp: improve trace log messages
Print the keyspace.table names, issue trace log messages also
when returning early if tombstone_gc is disabled or
when gc_check_only_compacting_sstables is set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24914
2025-07-14 11:16:58 +02:00
Calle Wilund
514fae8ced kmip_host: Close connections properly if dropped by pool being full
Fixes #24873

Note: this happens like never. But if we, in KMIP host, do release
of a connection (socket) due to our connection pool for the host being
full, we currently don't close the connection properly, only rely on
destructors.

While not very serious, this would lead to possible TLS errors in the
KMIP host used, which should be avoided if possible.

Fix is simple, just make release close the connection if it neither retains
nor caches it.
2025-07-14 08:31:02 +00:00
Calle Wilund
0fe8836073 encryption_at_rest_test: Do port check using TLS
If we connect using just a socket, and don't terminate connection
nicely, we will get annoying errors in PyKMIP log. These distract
from real errors. So avoid them.
2025-07-14 08:31:02 +00:00
Yaron Kaikov
ed7c7784e4 auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954
2025-07-14 11:20:23 +03:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Yaron Kaikov
66ff6ab6f9 packaging: add ps command to dependancies
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.

Fixes: #24827

Closes scylladb/scylladb#24830
2025-07-13 17:09:05 +03:00
Aleksandra Martyniuk
2ec54d4f1a replica: hold compaction group gate during flush
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.

Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.

Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.

Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.

Fixes: #23911.

Closes scylladb/scylladb#24582
2025-07-13 12:35:19 +03:00
Benny Halevy
0e455c0d45 utils: clear_gently: add support for sets
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.

To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.

Add respective unit tests for set and unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24608
2025-07-13 12:30:45 +03:00
Emil Maskovsky
f6bb5cb7a0 raft: fix voter assignment of transitioning nodes
Previously, nodes would become voters immediately after joining, ensuring
voter status was established before bootstrap completion. With the limited
voters feature, voter assignment became deferred, creating a timing gap
where nodes could finish bootstrapping without becoming voters.

This timing issue could lead to quorum loss scenarios, particularly
observed in tests but theoretically possible in production environments.

This commit reorders voter assignment to occur before the
`update_topology_state()` call, ensuring nodes achieve voter status
before bootstrap operations are marked complete. This prevents the
problematic timing gap while maintaining compatibility with limited
voters functionality.

If voter assignment succeeds but topology state update fails, the
operation will raise an exception and be retried by the topology
coordinator, maintaining system consistency.

This commit also fixes issue where the `update_nodes` ignored leaving
voters potentially exceeding the voter limit and having voters
unaccounted for.

Fixes: scylladb/scylladb#24420
2025-07-11 17:59:12 +02:00
Tomasz Grabiec
dff2b01237 streaming: Avoid deadlock by running view checks in a separate scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes: #24807
2025-07-11 16:30:46 +02:00
Tomasz Grabiec
ee2fa58bd6 service: migration_manager: Run group0 barrier in gossip scheduling group
Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925
2025-07-11 16:29:31 +02:00
Andrei Chekun
f7c7877ba6 test.py: add bypassing x_log2_compaction_groups to boost tests
Bypassing argument to pytest->boost that was missing.
2025-07-11 12:30:09 +02:00
Andrei Chekun
71b875c932 test.py: add bypassing random seed to boost tests
Bypassing argument to pytest->boost that was missing.

Fixes: https://github.com/scylladb/scylladb/issues/24927
2025-07-11 12:30:08 +02:00
Gleb Natapov
89f2edf308 api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911
2025-07-10 17:16:44 +02:00
Andrei Chekun
64a095600b test.py: break the loop when there is no tests for pytest
Quit from the repeats if the test is under the pytest runner directory and has
some typos or is absent. This allows not going several times through the
discovery and stopping execution.
2025-07-10 15:09:28 +02:00
Piotr Dulikowski
d9aec89c4e Merge 'vector_store_client: implement vector_store_client service' from Pawel Pery
Vector Store service is a http server which provides vector search index and an ANN (Approximate Nearest Neighbor) functionality. Vector Store retrieves metadata & data from Scylla about indexes using CQL protocol & CDC functionality. Scylla will request ann search using http api.

Commits for the patch:
- implement initial `vector_store_client` service. It adds also a parameter `vector_store_uri` to the scylla.
- refactor sequential_producer as abortable
- implement ip addr retrieval from dns. The uri for Vector Store must contains dns name, this commit implements ip addr refreshing functionality
- refactor primary_key as a top-level class. It is needed for the forward declaration of a primary_key
- implement ANN API. It implements a core ANN search request functionality, adds Vector Store HTTP API description in docs/protocols.md, and implements automatic boost tests with mocked http server for checking error conditions.

New feature, should not be backported.

Fixes: VECTOR-47
Fixes: VECTOR-45

-~-

Closes scylladb/scylladb#24331

* github.com:scylladb/scylladb:
  vector_store_client: implement ANN API
  cql3: refactor primary_key as a top-level class
  vector_store_client: implement ip addr retrieval from dns
  utils: refactor sequential_producer as abortable
  vector_store_client: implement initial vector_store_client service
2025-07-10 13:18:20 +02:00
Marcin Maliszkiewicz
ace7d53cf8 test: add type creation to test_snapshot
It coverts the case when new type and new keyspace
are created together.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
15b4db47c7 storage_service: always wake up load balancer on update tablet metadata
Lack of wakeup is error-prone, as it relies on a wakeup occurring
elsewhere.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
c62a022b43 db: schema_applier: call destroy also when exception occurs
Otherwise objects may be destroyed on wrong shard, and assert
will trigger in ~sharded().
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
b103fee5b6 db: replica: simplify seeding ERM during shema change
We know that caller is running on shard 0 so we can avoid some extra boilerplate.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
44490ceb77 db: remove cleanup from add_column_family
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.

Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
317da13e90 db: abort on exception during schema commit phase
As we have no way to recover from partial commit.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
81c3dabe06 db: make user defined types changes atomic
The same order of creation/destruction is preserved as in the
original code, looking from single shard point of view.

create_types() is called on each shard separately, while in theory
we should be able reuse results similarly as diff_rows(). But we
don't introduce this optimization yet.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
e3f92328d3 replica: db: make keyspace schema changes atomic
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).

In the future we'll extend this to the whole schema
and also other subsystems.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
b18cc8145f db: atomically apply changes to tables and views
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
  functions

Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
19bc6ffcb0 replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
5ad1845bd6 service: split update_tablet_metadata into two phases
In following commits calls will be split in schema_applier.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
2f840e51d1 service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
fa157e7e46 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
847d7f4a3a service: simplify load_tablet_metadata and update_tablet_metadata
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)

- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
e242ae7ee8 db: don't perform move on tablet_hint reference
This lambda is called several times so there should be no move.
Currently the bug likely doesn't manifest as code does work
only on shard 0.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
1c5ec877a7 replica: split add_column_family_and_make_directory into steps
This is similar work as for drop_table in previous commit.

add_column_family_and_make_directory() behaves exactly the same
as before but calls to it in schema_applier will be replaced by
calls directly to split steps. Other usages will remain intact as
they don't need atomicity (like creating system tables at startup).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
c2cd02272a replica: db: split drop_table into steps
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.

We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()

prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.

We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.

Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
d00266ac49 db: don't move map references in merge_tables_and_views()
Since they are const it's not needed and misleading.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
fdaff143be db: introduce commit_on_shard function
This will be the place for all atomic schema switching
operations.

Note that atomicity is observed only from single shard
point of view. All shards may switch at slightly different times
as global locking for this is not feasible.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
2e69016c4f db: access types during schema merge via special storage
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
71bd452075 replica: make non-preemptive keyspace create/update/delete functions public
As those operations will be managed by schema_applier class. This
will be implemented in following commit.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
dce0e65213 replica: split update keyspace into two phases
- first phase is preemptive (prepare_update_keyspace)
- second phase is non-preemptive (update_keyspace)

This is done so that schema change can be applied atomically.

Aditionally create keyspace code was changed to share common
part with update keyspace flow.

This commit doesn't yet change the behaviour of the code,
as it doesn't guarantee atomicity, it will be done in following
commits.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
734f79e2ad replica: split creating keyspace into two functions
This is done so that in following commits insert_keyspace can be used
to atomically change schema (as it doesn't yield).
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
ec270b0b5e db: rename create_keyspace_from_schema_partition
It only creates keyspace metadata.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
9c856b5785 db: decouple functions and aggregates schema change notification from merging code 2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
32b2786728 db: store functions and aggregates change batch in schema_applier
To be used in following commit.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
bc2d028f77 db: decouple tables and views schema change notifications from merging code
As post_commit() can't be fully implemented at this stage,
it was moved to interim place to keep things working.
It will be moved back later.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
af5e0d7532 db: store tables and views schema diff in schema_applier
It will be used in subsequent commit for moving
notifications code.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
9c8f3216ab db: decouple user type schema change notifications from types merging code
Merging types code now returns generic affected_types structure which
is used both for notifications and dropping types. New static
function drop_types() replaces dropping lambda used before.

While I think it's not necessary for dropping nor notifications to
use per shard copies (like it's using before and after this patch)
it could just use string parameters or something similar but
this requires too many changes in other classes so it's out of scope
here.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
ae81497995 service: unify keyspace notification functions arguments
Keyspace metadata is not used, only name is needed so
we can remove those extra find_keyspace() calls.

Moreover there is no need to copy the name.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
45c5c44c2d db: replica: decouple keyspace schema change notifications to a separate function
In following commits we want to separate updating code from committing
shema change (making it visible). Since notifications should be issued
after change is visible we need to separate them and call after
committing.

In subsequent commits other notification types will be moved too.

We change here order of notification calls with regards to rest
of schema updating code. I.e. before keyspace notifications triggered
before tables were updated, after the change they will trigger once
everything is updated. There is no indication that notification
listeners depend on this behaviour.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
96332964b7 db: add class encapsulating schema merging
This commit doesn't yet change how schema merging
works but it prepares the ground for it.

We split merging code into several functions.
Main reasons for it are that:

- We want to generalize and create some interface
which each subsystem would use.

- We need to pull mutation's apply() out
of the code because raft will call it directly,
and it will contain a mix of mutations from more
than one subsystem. This is needed because we have
the need to update multiple subsystems atomically
(e.g. auth and schema during auto-grant when creating
a table).

In this commit do_merge_schema() code is split between
prepare(), update(), commit(), post_commit(). The idea
behind each of these phases is described in the comments.
The last 2 phases are not yet implemented as it requires more
code changes but adding schema_applier enclosing class
will help to create some copied state in the future and
implement commit() and post_commit() phases.
2025-07-10 10:40:42 +02:00
Asias He
ccce5f2472 test: Add test_boot_nodes.py
A simple add node test which can be used to test add large number of
nodes to a cluster.
2025-07-10 10:56:53 +08:00
Andrei Chekun
e34569bd92 test.py: handle max failures for pytest repeats
Pytest can handle max failures, but inside one run, and it was not affecting
the repeats. Repeats for pytest is just another execution of the process, so
there is no connection between them. With additional check, it will respect
max fails.

Closes scylladb/scylladb#24760
2025-07-09 19:57:58 +02:00
Michael Litvak
fa24fd7cc3 tablets: stop storage group on deallocation
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.

Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.

However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
   allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
   stopped.

To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.

Fixes scylladb/scylladb#24857
Fixes scylladb/scylladb#24828

Closes scylladb/scylladb#24896
2025-07-09 19:29:14 +03:00
Aleksandra Martyniuk
17272c2f3b repair: Reduce max row buf size when small table optimization is on
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.

On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.

In large clusters, repair with small_table_optimization leads to OOM.

Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.

Use max_row_buf_size to calculate number of units taken from mem_sem.

Fixes: https://github.com/scylladb/scylladb/issues/22244.

Closes scylladb/scylladb#24868
2025-07-09 16:55:38 +03:00
Avi Kivity
0138afa63b service: tablet_allocator: avoid large contiguous vector in make_repair_plan()
make_repair_plan() allocates a temporary vector which can grow larger
than our 128k basic allocation unit. Use a chunked vector to avoid
stalls due to large allocations.

Fixes #24713.

Closes scylladb/scylladb#24801
2025-07-09 12:50:02 +02:00
Pawel Pery
eadbf69d6f vector_store_client: implement ANN API
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It implements a functionality for ANN search request to a vector-store
service. It sends request, receive response and after parsing it returns
the list of primary keys.

It adds json parsing functionality specific for the HTTP ANN API.

It adds a hardcoded http request timeout for retrieving response from
the Vector Store service.

It also adds an automatic boost test of the ANN search interface, which
uses a mockup http server in a background to simulate vector-store
service.

It adds a documentation for HTTP API protocol used used for ANN
functionality.

Fixes: VS-47
2025-07-09 11:54:51 +02:00
Pawel Pery
5bfce5290e cql3: refactor primary_key as a top-level class
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

There is a need for forward declaration of primary_key class. This patch
moves a nested definition of select_statement::primary_key (from a
cql3::statements namespace) into a standalone class in a
cql3::statements namespace.

Reference: VS-47
2025-07-09 11:54:51 +02:00
Pawel Pery
1f797e2fcd vector_store_client: implement ip addr retrieval from dns
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It implements functionality for refreshing ip address of the
vector-store service dns name and creating a new HTTP client with that
address. It also provides cleanup of unused http clients. There are
hardcoded intervals for dns refresh and old http clients cleanup, and
timeout for requesting new http client.

This patch introduces two background tasks - for dns resolving
task and for cleanup old http clients.

It adds unit tests for possible dns refreshing issues.

Reference: VS-47
Fixes: VS-45
2025-07-09 11:54:51 +02:00
Emil Maskovsky
df37c514d3 raft: improve comments in group0 voter handler
Enhance code documentation in the group0 voter handler implementation.
2025-07-09 10:40:59 +02:00
Pawel Pery
8d3c33f74a utils: refactor sequential_producer as abortable
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

There is a need for abortable sequention_producer operator(). The
existing operator() is changed to allow timeout argument with default
time_point::max() (as current default usage) and the new operator() is
created with abort_source parameter.

Reference: VS-47
2025-07-08 16:29:55 +02:00
Pawel Pery
7bf53fc908 vector_store_client: implement initial vector_store_client service
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.

For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.

This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.

Reference: VS-47 VS-45
2025-07-08 16:29:55 +02:00
Yaniv Michael Kaul
82fba6b7c0 PowerPC: remove ppc stuff
We don't even compile-test it.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#24659
2025-07-08 10:38:23 +03:00
Piotr Dulikowski
6c65f72031 Merge 'batchlog_manager: abort replay of a failed batch on shutdown or node down' from Michael Litvak
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

Closes scylladb/scylladb#24595

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-07 16:48:07 +02:00
Andrei Chekun
ae6dc46046 test.py: skip cleaning artifacts when -s provided
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run
2025-07-07 15:42:11 +02:00
Patryk Jędrzejczak
2a52834b7f Merge 'Make it easier to debug stuck raft topology operation.' from Gleb Natapov
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

Closes scylladb/scylladb#24799

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-07 15:40:44 +02:00
Michał Hudobski
919cca576f custom_index: do not create view when creating a custom index
Currently we create a view for every index, however
for currently supported custom index classes (vector_index)
that work is redundant, as we store the index in the external
service.

This patch adds a way for custom indexes to choose whether to
create a view when creating the index and makes it so that
for vector indexes the view is not created.
2025-07-07 13:47:07 +02:00
Michał Hudobski
d4002b61dd custom_index: refactor describe for custom indexes
Currently, to describe an index we look at
a corresponding view. However for custom indexes
the view may not exist (as we are removing the views
from vector indexes). This commit adds a way for a custom
index class to override the default describing logic
and provides such an override for the vector_index
class.
2025-07-07 13:47:07 +02:00
Michał Hudobski
5de3adb536 custom_index: remove unneeded duplicate of a static string
We have got a duplicate of the same static string and
the only usage of one of the copies can be easily replaced
2025-07-07 13:47:06 +02:00
Piotr Dulikowski
ea35302617 Merge 'test: audit: enable syslog audit tests' from Andrzej Jackowski
Several audit test issues caused test failures, and in the result, almost all of audit syslog tests were marked with xfail.
This patch series enables the syslog audit tests, that should finally pass after the following fixes are introduced:
 - bring back commas to audit syslog (scylladb#24410 fix)
 - synchronize audit syslog server
 - fix parsing of syslog messages
 - generate unique uuid for each line in syslog audit
 - allow audit logging from multiple nodes

Fixes: scylladb/scylladb#24410

Test improvements, no backport required.

Closes scylladb/scylladb#24553

* github.com:scylladb/scylladb:
  test: audit: use automatic comparators in AuditEntry
  test: audit: enable syslog audit tests
  test: audit: sort new audit entries before comparing with expected ones
  test: audit: check audit logging from multiple nodes
  test: audit: generate unique uuid for each line in syslog audit
  test: audit: fix parsing of syslog messages
  test: audit: synchronize audit syslog server
  docs: audit: update syslog audit format to the current one
  audit: bring back commas to audit syslog
2025-07-07 12:45:44 +02:00
Pavel Emelyanov
84e1ac5248 sstables: Move versions static-assertion check to .cc file
Thiss check validates that static values of supported versions are "in
sync" with each other. It's enough to do it once when compiling
sstable_version.cc, not every time the header is included.

refs: #1 (not that it helps noticeably, but technically it fits)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24839
2025-07-07 13:16:21 +03:00
Michael Litvak
d7af26a437 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.
2025-07-07 12:24:05 +03:00
Michael Litvak
a9b476e057 test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.
2025-07-07 12:23:06 +03:00
Michael Litvak
74a3fa9671 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.
2025-07-07 12:23:06 +03:00
Michael Litvak
7150632cf2 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.
2025-07-07 12:23:06 +03:00
Michael Litvak
fc5ba4a1ea batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.
2025-07-07 12:23:06 +03:00
Michael Litvak
8d48b27062 storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.
2025-07-07 12:23:06 +03:00
Nadav Har'El
18b6c4d3c5 alternator: lower maximum table name length to 192
Currently, Alternator allows creating a table with a name up to 222
(max_table_name_length) characters in length. But if you do create
a table with such a long name, you can have some difficulties later:
You you will not be able to add Streams or GSI or LSI to that table,
because 222 is also the absolute maximum length Scylla tables can have
and the auxilliary tables we want to create (CDC log, materialized views)
will go over this absolute limit (max_auxiliary_table_name_length).

This is not nice. DynamoDB users assume that after successfully
creating a table, they can later - perhaps much later - decide to
add Streams or GSI to it, and today if they chose extremely long
names, they won't be able to do this.

So in this patch, we lower max_table_name_length from 222 to 192.
A user will not be able to create tables with longer names, but
the good news is that once successfully creating a table, it will
always be possible to enable Streams on it (the CDC log table has an
extra 15 bytes in its name, and 192 + 15 is less than 222), and it
will be possible to add GSIs with short enough names (if the GSI
name is 29 or less, 192 + 29 + 1 = 222).

This patch is a trivial one-line code change, but also includes the
corrected documentation of the limits, and a fix for one test that
previously checked that a table name with length 222 was allowed -
and now needs to check 192 because 222 is no longer allowed.

Note that if a user has existing tables and upgrades Scylla, it
is possible that some pre-existing Alternator tables might have
lengths over 192 (up to 222). This is fine - in the previous patches
we made sure that even in this case, all operations will still work
correctly on these old tables (by not not validating the name!), and
we also made sure that attempting to enable Streams may fail when
the name is too long (we do not remove those old checks in this patch,
and don't plan to remove them in the forseeable future).

Note that the limit we chose - 192 characters - is identical to the
table name limit we recently chose in CQL. It's nicer that we don't
need to memorize two different limits for Alternator and CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:58:21 +03:00
Nadav Har'El
3ed8e269f9 alternator: don't crash when adding Streams to long table name
Currently, in Alternator it is possible to create a table whose name has
222 characters, and then trying to add Streams to that table results in
an attempt to create a CDC log table with the same name plus a
15-character suffix "_scylla_cdc_log", which resulted (Ref #24598) in
an IO-error and a Scylla shutdown.

This patch adds code to the Stream-adding operations (both CreateTable
and UpdateTable) that validates that the table's name, plus that 15
character suffix, doesn't exceed max_auxiliary_table_name_length, i.e.,
222.

After this patch, if you have a table whose name is between 207 and 222
characters, attempting to enable Streams on it will fail with:

 "Streams cannot be added if the table name is longer than 207 characters."

Note that in the future, if we lower max_table_name_length to below 207,
e.g., to 192, then it will always be possible to add a stream to any
legal table, and the new checks we had here will be mostly redundant.
But only "mostly" - not entirely: Checking in UpdateTable is still
important because of the possibility that an upgrading user might have
a pre-existing table whose name is longer than the new limit, and might
try to enable Streams.

After this patch, the crash reported in #24598 can no longer happen, so
in this sense the bug is solved. However, we still want to lower
max_table_name_length from 222 to 192, so that it will always be
possible to enable streams on any table with a legal name length.
We'll do this in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:58:13 +03:00
Nadav Har'El
898665ca38 alternator: split length limit for regular and auxiliary tables
Alternator has a constant, max_table_name_length=222, which is currently
used for two different things:

1. Limiting the length of the name allowed for Alternator table.
2. Limiting the length of some auxiliary tables the user is not aware
   of, such as a materialized view (whose name is tablename:indexname)
   or (in the next patch) CDC log table.

In principle, there is no reason why these two limits need to be identical -
we could lower the table name limit to, say, 192, but still allow the
tablename:indexname to be even longer, up to 222 - i.e., allow creating
materialized views even on tables whose name has 192 characters.

So in this patch we split this variable into two, max_table_name_length
and max_auxiliary_table_name_length. At the moment, both are still set
to the same value - 222. In a following patch we plan to lower
max_table_name_length but leave max_auxiliary_table_name_length at 222.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 11:43:49 +03:00
Gleb Natapov
4e6369f35b topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.
2025-07-07 10:46:22 +03:00
Gleb Natapov
c8ce9d1c60 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.
2025-07-07 10:46:03 +03:00
Nadav Har'El
09aa062ab6 alternator: avoid needlessly validating table name
In commit d8c3b144cb we fixed #12538:
That issue noted that most requests which take a TableName don't need
to "validate" the table's name (check that it has allowed characters
and length) if the table is found in the schema. We only need to do
this validation on CreateTable, or when the table is *not* found
(because in that case, DynamoDB chose to print a validation error
instead of table-not-found error).

It turns out that the fix missed a couple of places where the name
validation was unnecessary, so this patch fixes those remaining places.

The original motivation for fixing was #12538 was performance, so
it focused just one cheap common requests. But now, we want to be sure
we fixed *all* requests, because of a new motivation:

We are considering, due to #24598, to lower the maximum allowed table
name length. However, when we'll do that, we'll want the new lower
length limit to not apply to already existing tables. For example,
it should be possible to delete a pre-existing table with DeleteTable,
if it exists, without the command complaining that the name of this table
is too long. So it's important to make sure that the table's name is
only validated in CreateTable or if the table does not exist.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-07 10:05:43 +03:00
Avi Kivity
d4efefbd9c Merge 'Improve background disposal of tablet_metadata' from Benny Halevy
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.

This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.

Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.

Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.

Fixes #24814
Refs #23284

This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.

* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards

Closes scylladb/scylladb#24618

* github.com:scylladb/scylladb:
  token_metadata_impl: clear_gently: release version tracker early
  test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
  token_metadata: clear_and_destroy_impl when destroyed
  token_metadata: keep a reference to shared_token_metadata
  token_metadata: move make_token_metadata_ptr into shared_token_metadata class
  replica: database: get and expose a mutable locator::shared_token_metadata
  locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
2025-07-06 19:43:50 +03:00
Benny Halevy
6e4803a750 token_metadata_impl: clear_gently: release version tracker early
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
4a3d14a031 test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
Reproduces #23284

Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
2c0bafb934 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
2b2cfaba6e token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
e0a19b981a token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 14:22:20 +03:00
Benny Halevy
493a2303da replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 14:22:20 +03:00
Benny Halevy
3acca0aa63 locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 14:20:46 +03:00
Ernest Zaslavsky
8ac2978239 sstables: coroutinize futurized readers
Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable`
2025-07-06 09:18:39 +03:00
Ernest Zaslavsky
0de61f56a2 sstables: add make_data_or_index_source to the storage
Add `make_data_or_index_source` to the `storage` interface, implement it
for `filesystem_storage` storage which just creates `data_source` from a
file and for the `s3_storage` create a (maybe) decrypting source from s3
make_download_source.

This change should solve performance improvement for reading large objects
from S3 and should not affect anything for the `filesystem_storage`.
2025-07-06 09:18:39 +03:00
Ernest Zaslavsky
7e5e3c5569 encryption: refactor key retrieval
Get the encryption schema extension retrieval code out of
`wrap_file` method to make it reusable elsewhere
2025-07-06 09:18:39 +03:00
Ernest Zaslavsky
211daeaa40 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
2025-07-06 09:18:39 +03:00
Pavel Emelyanov
4d6385fc27 api: Remove unused get_json_return_type() templates
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24837
2025-07-05 18:42:02 +03:00
Avi Kivity
33225b730d Merge 'Do not reference db::config by transport::server' from Pavel Emelyanov
The db::config is top-level configuration class that includes options for pretty much everything in Scylla. Instead of messing with this large thing, individual services have their own smaller configs, that are initialized with values from db::config. This PR makes it for transport::server (transport::controller will be next) and its cql_server_config. One bad thing not to step on is that updateable_value is not shard-safe (#7316), but the code in controller that creates cql_server_config is already taking care.

Closes scylladb/scylladb#24841

* github.com:scylladb/scylladb:
  transport: Stop using db::config by transport::server
  transport: Keep uninitialized_connections_semaphore_cpu_concurrency on cql_server_config
  transport: Move cql_duplicate_bind_variable_names_refer_to_same_variable to cql_server_config
  transport: Move max_concurrent_requests to struct config
  transport: Use cql_server_config::max_request_size
2025-07-05 18:39:01 +03:00
Pavel Emelyanov
9b178df7dd transport: Stop using db::config by transport::server
Now the server is self-contained in the way it is being configured by
the controller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-04 15:40:20 +03:00
Pavel Emelyanov
e2c1484d8d transport: Keep uninitialized_connections_semaphore_cpu_concurrency on
cql_server_config

This also repeats previous patch for another updateable_value. The thing
here is that this config option is passed further to generic_server, but
not used by transport::server itslef.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-04 15:40:20 +03:00
Pavel Emelyanov
64ffe67cbd transport: Move cql_duplicate_bind_variable_names_refer_to_same_variable
to cql_server_config

Similarly to previous patch -- move yet another updateable_value to let
transport::server eventually stop messing with db::config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-04 15:40:14 +03:00
Pavel Emelyanov
b6546ed5ff transport: Move max_concurrent_requests to struct config
This is updateable_value that's initialized from db::config named_value
to tackle its shard-unsafety. However, the cql_server_config is created
by controller using sharded_parameter() helper, so that is can be safely
passed to server.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-04 15:35:55 +03:00
Pavel Emelyanov
6075eca168 transport: Use cql_server_config::max_request_size
It's duplicated on config and the transport::server that aggregates the
config itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-04 15:34:53 +03:00
Andrei Chekun
d81820f529 test.py: move deleting directory to prepare_dir
Instead of explicitly call removing directory move it to prepare_dir
method. If the passed pattern is '*' than directory will be deleted, in
other casses only files found by pattern
2025-07-04 13:39:42 +02:00
Andrzej Jackowski
55e542e52e test: audit: use automatic comparators in AuditEntry
Replace manual comparator implementations with generated comparators.
This simplifies future maintenance and ensures comparators
remain accurate when new fields are added.

Reorder fields in AuditEntry so the less-than comparator evaluates
the most significant fields first.
2025-07-04 13:08:29 +02:00
Andrzej Jackowski
d7711a5f3a test: audit: enable syslog audit tests
Several audit test issues were resolved in numerous commits of this
patch series. This commit enables the syslog audit tests, that should
finally pass.
2025-07-04 12:40:57 +02:00
Andrzej Jackowski
3ebc693e70 test: audit: sort new audit entries before comparing with expected ones
In some corner cases, the order of audit entries can change. For
instance, ScyllaDB is allowed to apply BATCH statements in an order
different from the order in which they are listed in the statement.
To prevent test failures in such cases, this commit sorts new
audit entries.

Additionally, it is possible that some of the audit entries won't be
received by the SYSLOG server immediately. To prevent test failures
in this scenario, waiting for the expected number of new audit entries
is added.
2025-07-04 12:40:57 +02:00
Andrzej Jackowski
436e86d96a test: audit: check audit logging from multiple nodes
Before this change, the `assert_audit_row_eq` check assumed that
audit logs were always generated by the same (first) node. However,
this assumption is invalid in a multi-node setup.

This commit modifies the check to just verify that one of the nodes
in the cluster generated the audit log.
2025-07-04 12:40:57 +02:00
Andrzej Jackowski
2fefa29de7 test: audit: generate unique uuid for each line in syslog audit
Audit to TABLE uses a time UUID as a clustering key, while audit to
SYSLOG simply appends new lines. As a result, having such a detailed
time UUID is unnecessary for SYSLOG. However, TABLE tests expect each
line to be unique, and a similar check is performed (and fails)
in SYSLOG tests.

This commit updates the test framework to generate a unique UUID for
each line in SYSLOG audit. This ensures the tests remain consistent
for both TABLE and SYSLOG audit.
2025-07-04 12:40:57 +02:00
Andrzej Jackowski
f85e738b11 test: audit: fix parsing of syslog messages
Before this commit, there were following issues with parsing of syslog
messages in audit tests:
 - `line_to_row()` function was never called
 - `line_to_row()` was not prepared for changes introduced in
    scylladb#23099 (i.e. key=value pairs)
 - `line_to_row()` didn't handle newlines in queries
 - `line_to_row()` didn't handle "\\" escaping in queries

 Due to the aforementioned issues, the syslog audit tests were failing.
 This commit fixes all of those issues, by parsing each audit syslog
 message using a regexp.
2025-07-04 12:40:51 +02:00
Pavel Emelyanov
4d4406c5bc Merge 'test.py: dtest: port next_gating tests from auth_test.py' from Evgeniy Naydanov
Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`

As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker.

Remove `test_permissions_caching` test because it's too flaky when running using test.py

Also, make few time execution optimizations:
  - remove redundant `time.sleep(10)`
  - use smaller timeouts for CQL sessions

Enable the test in `suite.yaml` (run in dev mode only.)

Additional modifications to test.py/dtest shim code:

- Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair.
- Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.
- Copy generate_cluster_topology() function from tools/cluster_topology.py module.
- Add support for `bootstrap` parameter for `new_node()` function.
- Rework `wait_for_any_log()` function.

Closes scylladb/scylladb#24648

* github.com:scylladb/scylladb:
  test.py: dtest: make auth_test.py run using test.py
  test.py: dtest: rework wait_for_any_log()
  test.py: dtest: add support for bootstrap parameter for new_node
  test.py: dtest: add generate_cluster_topology() function
  test.py: dtest: add ScyllaNode.set_configuration_options() method
  test.py: pylib/manager_client: support batch config changes
  test.py: dtest: copy unmodified auth_test.py
  test.py: dtest: add missed markers to pytest.ini
2025-07-04 10:51:52 +03:00
Botond Dénes
258bf664ee scylla-gdb.py: sstable-summary: adjust for raw-tokens
01466be7b9 changed the summary entries, storing raw tokens in them,
instead of dht::token. Adjust the command so that it works with both
pre- and post- versions.
Also make it accept pointers to sstables as arguments, this is what
scylla sstables listing provides.

Closes scylladb/scylladb#24759
2025-07-04 10:44:25 +03:00
Patryk Jędrzejczak
8d925b5ab4 test: increase the default timeout of graceful shutdown
Multiple tests are currently flaky due to graceful shutdown
timing out when flushing tables takes more than a minute. We still
don't understand why flushing is sometimes so slow, but we suspect
it is an issue with new machines spider9 and spider11 that CI runs
on. All observed failures happened on these machines, and most of
them on spider9.

In this commit, we increase the timeout of graceful shutdown as
a temporary workaround to improve CI stability. When we get to
the bottom of the issue and fix it, we will revert this change.

Ref #12028

It's a temporary workaround to improve CI stability, we don't
have to backport it.

Closes scylladb/scylladb#24802
2025-07-04 10:43:38 +03:00
Avi Kivity
60f407bff4 storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811
2025-07-04 10:43:05 +03:00
Avi Kivity
5cbeae7178 sstables: drop minimum_key(), maximum_key()
Not used.

Closes scylladb/scylladb#24825
2025-07-04 10:42:44 +03:00
Dawid Mędrek
a151944fa6 treewide: Replace __builtin_expect with (un)likely
C++20 introduced two new attributes--likely and unlikely--that
function as a built-in replacement for __builtin_expect implemented
in various compilers. Since it makes code easier to read and it's
an integral part of the language, there's no reason to not use it
instead.

Closes scylladb/scylladb#24786
2025-07-03 13:34:04 +03:00
dependabot[bot]
59cc496757 build(deps): bump sphinx-scylladb-theme from 1.8.6 to 1.8.7 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.6 to 1.8.7.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.6...1.8.7)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#24805
2025-07-03 12:04:24 +03:00
Gleb Natapov
ca7837550d topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet
Old nodes do not expect global topology request names to be in
request_type field, so set it only if a cluster is fully upgraded
already.

Closes scylladb/scylladb#24731
2025-07-02 17:09:29 +02:00
Pavel Emelyanov
fa0077fb77 Merge 'S3 chunked download source bug fixes' from Ernest Zaslavsky
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

Closes scylladb/scylladb#24657

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-02 14:45:10 +03:00
Patryk Jędrzejczak
fa982f5579 docs: handling-node-failures: fix typo
Replacing "from" is incorrect. The typo comes from recently
merged #24583.

Fixes #24732

Requires backport to 2025.2 since #24583 has been backported to 2025.2.

Closes scylladb/scylladb#24733
2025-07-02 12:22:01 +03:00
Konstantin Osipov
37fc4edeb5 test.py: add a way to provide pytest arguments via test.py
Now that we use a single pytest.ini for all tests, different
developer preferences collide. There should be an easy way to override
pytest.ini defaults from the command line.

Fixes https://github.com/scylladb/scylladb/issues/21800

Closes scylladb/scylladb#24573
2025-07-02 12:20:43 +03:00
Nikos Dragazis
fbc9ead182 doc: Expose new aws_session_token option for KMS hosts
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-02 12:04:40 +03:00
Nikos Dragazis
4c66769e07 kms_host: Support authn with temporary security credentials
There are two types of AWS security credentials:
* long-term credentials (access key id + secret access key)
* temporary credentials (access key id + secret access key + session token)

The KMS host can obtain these credentials from multiple sources:
* IMDS (config option `aws_use_ec2_credentials`)
* STS, by assuming an IAM role (config option `aws_assume_role_arn`)
* Scylla config (options `aws_access_key_id`, `aws_secret_access_key`)
* Env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
* AWS credentials file (~/.aws/credentials)

First two sources return temporary credentials. The rest return
long-term credentials.

Extend the KMS host to support temporary credentials from the other
three sources as well. Introduce the config option `aws_session_token`,
and parse the same-named env var and config option from the credentials
file. Also, support `aws_security_token` as an alias, for backwards
compatibility.

This patch facilitates local debugging of corrupted SSTables, as well as
testing, using temporary credentials obtained from STS through other
authentication means (e.g., Okta + gimme-aws-creds).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-02 12:04:40 +03:00
Nikos Dragazis
37894c243d encryption_config: Mention environment in credential sources for KMS
The help string for the `--kms-hosts` command-line option mentions only
the AWS credentials file as a fall-back search path, in case no explicit
credentials are given.

Extend the help string to mention the environment as well. Make it clear
that the environment has higher precedence than the credentials file.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-02 12:04:40 +03:00
Avi Kivity
dfaed80f55 Merge 'types: add byte-comparable format support for native cql3 types' from Lakshmi Narayanan Sreethar
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#23541

* github.com:scylladb/scylladb:
  types/comparable_bytes: add testcase to verify compatibility with cassandra
  types/comparable_bytes: support variable-length natively byte-ordered data types
  types/comparable_bytes: support decimal cql3 types
  types/comparable_bytes: introduce count_digits() method
  types/comparable_bytes: support uuid and timeuuid cql3 types
  types/comparable_bytes: support varint cql3 type
  types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
  types/comparable_bytes: introduce encode/decode_varint_length
  types/comparable_bytes: support float and double cql3 types
  types/comparable_bytes: support date, time and timestamp cql3 types
  types/comparable_bytes: support bigint cql3 type
  types/comparable_bytes: support fixed length signed integers
  types/comparable_bytes: support boolean cql3 type
  types: introduce comparable_bytes class
  bytes_ostream: overload write() to support writing from FragmentedView
  docs: fix minor typo in docs/dev/cql3-type-mapping.md
2025-07-02 11:58:32 +03:00
Avi Kivity
1e0b015c8b Merge 'cql3: Represent create_statement using managed_bytes' from Dawid Mędrek
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.

Fixes scylladb/scylladb#24018

Backport: all of the supported versions are affected, so we want to backport the changes there.

Closes scylladb/scylladb#24151

* github.com:scylladb/scylladb:
  cql3/description: Serialize only rvalues of description
  cql3: Represent create_statement using managed_string
  cql3/statements/describe_statement.cc: Don't copy descriptions
  cql3: Use managed_bytes instead of bytes in DESCRIBE
  utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
2025-07-01 21:59:38 +03:00
Lakshmi Narayanan Sreethar
5f5a8cf54c types/comparable_bytes: add testcase to verify compatibility with cassandra 2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
6c1853a830 types/comparable_bytes: support variable-length natively byte-ordered data types
The following cql3 data types - ascii, blob, duration, inet, and text -
are natively byte-ordered in their serialized forms. To encode them into
a byte-comparable format, zeros are escaped, and since these types have
variable lengths, the encoded form is terminated in an escaped state to
mark its end.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
5c77d17834 types/comparable_bytes: support decimal cql3 types
The decimal cql3 type is internally stored as a scale and an unscaled
integer. To convert them into a byte comparable format, they are first
normalized into a base-100 exponent and a mantissa that lies in [0.01, 1)
and then encoded into a byte sequence that preserves the numerical order.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
832236d044 types/comparable_bytes: introduce count_digits() method
Implemented a method `count_digits()` to return the number of significant
digits in a given boost::multiprecision:cpp_int. This is required to
convert big_decimal to a byte comparable format.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
a00c5d3899 types/comparable_bytes: support uuid and timeuuid cql3 types
The uuid type values are composed of two fixed-length unsigned integers:
an msb and an lsb. The msb contains a version digit, which must be
pulled first in a byte-comparable representation. For version 1 uuids,
in addition to extracting the version digit first, the msb must be
rearranged to make it byte comparable. The lsb is written as is.

For the timeuuid type, the msb is handled simliar to the version 1 uuid
values. The lsb however is treated differently - the sign bits of all
bytes are inverted to preserve the legacy comparison order, which
compared individual bytes as signed values.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
4592b9764c types/comparable_bytes: support varint cql3 type
Any varint value less than 7 bytes is encoded using the signed long
encoding format and remaining values are all encoded using the full form
encoding :

  <signbyte><length as unsigned integer - 7><7 or more bytes>,

where <signbyte> is 00 for negative numbers and FF for positive ones,
and the length's bytes are inverted if the number is negative (so that
longer length sorts smaller).

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
1b6b0a665d types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
The decode_signed_long_type() method writes leading sign bytes when
decoding a byte-comparable encoded signed long value. The varint decoder
depends on this method to decode values up to a certain length and
expects the decoded form to include sign-only bytes only when necessary.
Update the decode_signed_long_type() code to allow skipping the write of
sign-only bytes based on the caller's request.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
ad45a19373 types/comparable_bytes: introduce encode/decode_varint_length
The length of a varint value is encoded separately as an unsigned
variable-length integer. For negative varint values, the encoded bytes
are flipped to ensure that longer lengths sort smaller. This patch
implements both encoding and decoding logic for varint lengths and will
be used by the subsequent patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
7af153c237 types/comparable_bytes: support float and double cql3 types
The sign bit is flipped for positive values to ensure that they are
ordered after negative values. For negative values, all the bytes are
inverted, allowing larger negative values to be ordered before smaller
ones.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
0145c1d705 types/comparable_bytes: support date, time and timestamp cql3 types
Both the date and time cql3 types are internally unsigned fixed length
integers. Their serialized form is already byte comparable, so the
encoder and decoder return the serialized bytes as it is.

The timestamp type is encoded using the fixed length signed integer
encoding.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
b6ff3f5304 types/comparable_bytes: support bigint cql3 type
The bigint type, internally implemented as a long data type, is encoded
using a variable-length encoding similar to UTF-8. This enables a
significant amount of space to be saved when smaller numbers are
frequently used, while still permitting large values to be efficiently
encoded.

The first bit of the encoding represents the inverted sign (i.e., 1 for
positive, 0 for negative), followed by length encoded as a sequence of
bits matching the inverted sign. This is then followed by a differing
bit (except for 9-byte encodings) and the bits of the number's two's
complement.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
c0d25060bd types/comparable_bytes: support fixed length signed integers
To encode fixed-length signed integers in a byte-comparable format, the
first bit of each value is inverted. This ensures that negative numbers
are ordered before positive ones during comparison. This patch adds
support for the data types : byte_type (tinyint), short_type (smallint),
and int32_type (int). Although long_type (bigint) is a fixed length
integer type, it has different byte comparable encoding and will be
handled separately in another patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
8572afca2b types/comparable_bytes: support boolean cql3 type
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
74c556a33d types: introduce comparable_bytes class
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
e4c7cb7834 bytes_ostream: overload write() to support writing from FragmentedView
Overloaded write() method to support writing a FragmentedView into
bytes_ostream. Also added a testcase to verify the implementation.
The new helper will be used by the byte_comparable implementation
during the encode/decode process.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
068e74b457 docs: fix minor typo in docs/dev/cql3-type-mapping.md
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Ernest Zaslavsky
acf15eba8e s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
a5246bbe53 s3_test: Add trace logging for default_retry_strategy
Introduce trace-level logging for `default_retry_strategy` in
`s3_test` to improve visibility into retry logic during test
execution.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
49e8c14a86 s3_client: Fix edge case when the range is exhausted
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
e50f247bf1 s3_client: Fix indentation in try..catch block
Correct indentation in the `try..catch` block to improve code
readability and maintain consistent formatting.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
d2d69cbc8c s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
c75acd274c s3_client: Enhance test coverage for retry logic
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
ec59fcd5e4 s3_client: Add test for Content-Range fix
Introduce a test that accurately verifies the Content-Range
behavior, ensuring the previous fix is properly validated.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
6d9cec558a s3_client: Fix missing negation
Restore a missing `not` in a conditional check that caused
incorrect behavior during S3 client execution.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
e73b83e039 s3_client: Refine logging
Fix typo in log message to improve clarity and accuracy during
S3 operations.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
f1d0690194 s3_client: Improve logging placement for current_range output
Relocated logging to occur after determining the `current_range`,
ensuring more relevant output during S3 client operations.
2025-07-01 18:45:17 +03:00
Tomasz Grabiec
97679002ee Merge 'Co-locate tablets of different tables' from Michael Litvak
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.

main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.

Fixes https://github.com/scylladb/scylladb/issues/17043

backport is not needed. this is preliminary work for support of MVs and CDC with tablets.

Closes scylladb/scylladb#22906

* github.com:scylladb/scylladb:
  tablets: validate no clustering row mutations on co-located tables
  raft_group0_client: extend validate_change to mixed_change type
  docs: topology-over-raft: document co-located tables
  tablet-mon.py: visual indication for co-located tablets
  tablet-mon.py: handle co-located tablets
  test/boost/view_schema_test.cc: fix race in wait_until_built
  boost/tablets_test: test load balancing and resize of co-located tablets
  test/tablets: test tablets colocation
  tablets: co-locate view tablets with base when the partition keys match
  test/pylib/tablets: common get_tablet_count api
  test_mv_tablets: use get_tablet_replicas from common tablets api
  test/pylib/tablets: fix test api to read tablet replicas from base table
  tablets: allocator: create co-located tables in a single operation
  alternator: prepare all new tables in a single announcement
  migration_manager: add notification for creating multiple tables
  tablets: read_tablet_transition_stage: read from base table
  storage service: allow repair request only on base tables
  tablets: keyspace_rf_change: apply on base table
  storage service: generate tablet migration updates on base tables
  tablets: replace all_tables method
  tablets: split when all co-located tablets are ready
  tablets: load balancer: sizing plan for table groups
  tablets: load balancer: handle co-located tablets
  tablets: allocate co-located tablets
  tablets: handle migration of co-located tablets
  storage service: add repair colocated tablets rpc
  tablets: save and read tablet metadata of co-located tables
  tablets: represent co-located tables in tablet metadata
  tablets: add base_table column to system.tablets
  docs: update system.tablets schema
2025-07-01 16:02:30 +02:00
Tomasz Grabiec
6290b70d53 Merge 'repair: postpone repair until topology is not busy ' from Aleksandra Martyniuk
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

Closes scylladb/scylladb#24202

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-07-01 16:02:22 +02:00
Botond Dénes
37ef9efb4e docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763
2025-07-01 16:19:18 +03:00
Michał Chojnowski
a29724479a utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752
2025-07-01 15:13:04 +03:00
Łukasz Paszkowski
a22d1034af test.py: Fix test_compactionhistory_rows_merged_time_window_compaction_strategy
The test has two major problems
1. Wrongly computed time windows. Data was not spread across two 1-minute
   windows causing the test to generate even three sstables instead
   of two
2. Timestamp was not propagated to the prepared CQL statements. So
   in fact, a current time was used implicitly
3. Because of the incorrect timestamp issue, the remaining tests
   testing purged tombstones were affected as well.

Fixes https://github.com/scylladb/scylladb/issues/24532

Closes scylladb/scylladb#24609
2025-07-01 15:01:21 +03:00
Avi Kivity
609cc20d22 build: update toolchain to Fedora 42 with clang 20.1 and libstdc++ 15
Rebase to Fedora 42 with clang 20.1 and libstdc++ 15.

JAVA8_HOME environment variable dropped since we no longer use it.

cassandra-stress package updated with version that doesn't depend
on no-longer-available Java 11.

Optimized clang binaries generates and stored in

  https://devpkg.scylladb.com/clang/clang-20.1.7-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.7-Fedora-42-x86_64.tar.gz

Closes scylladb/scylladb#23978
2025-07-01 14:39:47 +03:00
Dawid Mędrek
9d03dcd28e cql3/description: Serialize only rvalues of description
We discard instances of `cql3::description` right after serializing them,
so let's change the signature of the function to save some work.
2025-07-01 12:58:11 +02:00
Dawid Mędrek
ac9062644f cql3: Represent create_statement using managed_string
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.

Fixes scylladb/scylladb#24018
2025-07-01 12:58:02 +02:00
Michael Litvak
a5feb80797 tablets: validate no clustering row mutations on co-located tables
When preparing a tablet metadata change, add another validation that no
clustering row mutations are written to the tablet map of a co-located
dependent table.

A co-located table should never have clustering rows in the
`system.tablets` table. It has only the static row with base_table
column set, pointing to the base table.
2025-07-01 13:20:20 +03:00
Michael Litvak
6619e798e7 raft_group0_client: extend validate_change to mixed_change type
The function validate_change in raft_group0_client is used currently to
validate tablet metadata changes, and therefore it applies only to
commands of type topology_change.

But the type mixed_change also allows topology change mutations and it's
in fact used for tablet metadata changes, for example in
keyspace_rf_change.

Therefore, extend validate_change to validate also changes of type
mixed_change, so we can catch issues there as well.
2025-07-01 13:20:19 +03:00
Michael Litvak
6fa5d2f7c8 docs: topology-over-raft: document co-located tables 2025-07-01 13:20:19 +03:00
Michael Litvak
cb9e03bd09 tablet-mon.py: visual indication for co-located tablets
Add a visual indication for groups of co-located tablets in
tablet-mon.py.

We order the tablets by groups, and draw a rectangle that connects
tablets that are co-located
2025-07-01 13:20:19 +03:00
Michael Litvak
b35b7c4970 tablet-mon.py: handle co-located tablets
For co-located tablets we need to read the tablet information from the
tablet map referenced by base_table.

Fix tablet-mon.py to handle co-located tablets by checking if base_table
is set when reading the tablets of a table, and if so refer to the base
table map.
2025-07-01 13:20:19 +03:00
Michael Litvak
fb18b95b3c test/boost/view_schema_test.cc: fix race in wait_until_built
create the view waiter before creating the view, otherwise if the waiter
is created after the view is built we may lose the notification.
2025-07-01 13:20:19 +03:00
Michael Litvak
3b4af89615 boost/tablets_test: test load balancing and resize of co-located tablets
Add unit tests of load balancing and resize with co-located tablets.
2025-07-01 13:20:19 +03:00
Michael Litvak
65ed0548d6 test/tablets: test tablets colocation
Add tests with co-located tablets, testing migration and other relevant
operations.
2025-07-01 13:20:19 +03:00
Michael Litvak
7211c0b490 tablets: co-locate view tablets with base when the partition keys match
For a view table that has the same partition key as its base table, the
view's tablets are co-located with the base tablets

Fixes scylladb/scylladb#17043
2025-07-01 13:20:19 +03:00
Michael Litvak
e01aae7871 test/pylib/tablets: common get_tablet_count api
Introduce a common get_tablet_count test api instead of it being
duplicated in few tests, and fix it to read the tablet count from the
base table.
2025-07-01 13:20:19 +03:00
Michael Litvak
e719da3739 test_mv_tablets: use get_tablet_replicas from common tablets api
Replace the duplicated get_tablet_replicas method in test_mv_tablets
with the common method from the tablets api, to reduce code duplication
and use the correct method that reads the tablet replicas from the base
table.
2025-07-01 13:20:19 +03:00
Michael Litvak
6bfb82844f test/pylib/tablets: fix test api to read tablet replicas from base table
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.

We fix and simplify the test api for reading tablet replicas to read
from the base table.
2025-07-01 13:20:19 +03:00
Michael Litvak
018b61f658 tablets: allocator: create co-located tables in a single operation
Co-located base and child tables may be created together in a single
operation.  The tablet allocator in this case needs to handle them
together and not each table independently, because we need to have the
base schema and tablet map when creating the child tablet map.

We do this by registering the tablet allocator to the migration
notification on_before_create_column_families that announces multiple
new tables, and there we allocate tablets for all the new base tables,
and for the new child tables we create their maps from the base tables,
which are either a new table or an existing one.
2025-07-01 13:20:19 +03:00
Michael Litvak
2d0ec1c20a alternator: prepare all new tables in a single announcement
When creating base and view tables in alternator, they are created in a
single operation, so use a single announcement for creating multiple
tables in a single operation instead of announcing each table
separately.

This is needed because when we create base tables and local indexes we
need to make them co-located, so we need to allocate tablets for them
together.
2025-07-01 13:20:18 +03:00
Michael Litvak
05ffcefd50 migration_manager: add notification for creating multiple tables
Add prepare_new_column_families_announcement for preparing multiple new
tables that are created in a single operation.

A listener can receive a notification when multiple tables are created.
This is useful if the listener needs to have all the new tables, and not
work on each new table independently. For example, if there are
dependencies between the new tables.
2025-07-01 13:20:18 +03:00
Michael Litvak
064ac25ff9 tablets: read_tablet_transition_stage: read from base table
When reading tablet information from system.tablets we need to read it
from the base table, if exists.
2025-07-01 13:20:18 +03:00
Michael Litvak
ff9a3c9528 storage service: allow repair request only on base tables
Currently, tablet repair runs only on base tables, and not on derived
co-located tables.

If repair is requested for a non base table throw an error since the
operation won't have the intended results.
2025-07-01 13:20:18 +03:00
Michael Litvak
aa990a09c1 tablets: keyspace_rf_change: apply on base table
Generate keyspace_rf_change transitions only on base tables, because in
a group of co-located tablets their tablet map is shared with the base
table.
2025-07-01 13:20:18 +03:00
Michael Litvak
602fa84907 storage service: generate tablet migration updates on base tables
When writing transition updates to a tablet map we must do so on a base
table. A table that is co-located with a base table doesn't have it's
own tablet map in the tablets table, but it only points to the base
table map. By writing to the base table, the tablet migration will be
applied for the entire co-location group.

We add a small helper in storage_service that creates a tablet mutation
builder for the base table, and use it whenever we need to write tablet
mutations.
2025-07-01 13:20:18 +03:00
Michael Litvak
ddf02c9489 tablets: replace all_tables method
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.

Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.

We replace all_tables with new methods that can be used for each of the
cases.
2025-07-01 13:20:18 +03:00
Michael Litvak
255ca569e3 tablets: split when all co-located tablets are ready
For a group of co-located tablets, they must be split together
atomically, so finalize tablet split only when all tablets in the group
are ready.
2025-07-01 13:20:18 +03:00
Michael Litvak
0dcb9f2ed6 tablets: load balancer: sizing plan for table groups
We update the sizing plan to work with table groups instead of single
tables, using the base table as a representative of a table group.

The resize decision is made based on the combined per-table tablet
hints, and considering the size of all tables in the group. We calculate
the average tablet size of all tablets in the group and compare it with
the target tablet size.

The target tablet size is changed to be some function of the group size,
because we may want to have a lower target tablet size when we have
multiple co-located tablets, in order to reduce the migration size.
2025-07-01 13:20:18 +03:00
Michael Litvak
ac5f4da905 tablets: load balancer: handle co-located tablets
Tablets of co-located tables are always co-located and migrated
together, so they are considered as an atomic unit for the tablets load
balancer.

We change the load balancer to work with table groups as migration
candidates instead of single tables, using the base table of a group as
a representative of the group.

For the purpose of load calculations, a group of co-located tablets is
considered like a single tablet, because their combined target tablet
sizes is the same as a single tablet.
2025-07-01 13:20:18 +03:00
Michael Litvak
3db8f6fd37 tablets: allocate co-located tablets
When allocating tablets for a new table, add the option to create a
co-located tablet map with an existing base table.

The co-located tablet map is created with the base_table value set.
2025-07-01 13:20:18 +03:00
Michael Litvak
6bed9d3cfe tablets: handle migration of co-located tablets
When handling tablet transition for a group of co-located tables,
maintain co-location by applying each transition operation (streaming,
cleanup, repair) on all tablets in the group in a synchronized way.

handle_tablet_migration is changed to work on groups of co-located
tablets instead of single tablets. Each transition step is handled by
applying its operation on all the tablets in the group.

The tablet map of co-located tablets is shared, so we need to read and
write only the tablet map of the base table.
2025-07-01 13:20:18 +03:00
Michael Litvak
11f045bb7c storage service: add repair colocated tablets rpc
add a new RPC repair_colocated_tablets which is similar to the RPC
tablet_repair, but instead of repairing a single tablet it takes a set
of co-located tablets, repairs them and returns a shared repair_time
result.

This is useful because the way co-located tablets are represented
doesn't allow to repair tablets independently but only as a group
operation, and the repair_time which is stored in the tablet map is
shared with the entire co-location group.

But when repairing a group of co-located tablets we may require a
different behavior, especially considering that co-located tablets are
derived tablets of a special type. For example, we may want to skip
running repair on CDC tablets when repairing the base table.

The new RPC and the storage service function repair_colocated_tablets
allow the flexibility to implement different strategies when repairing
co-located groups.

Currently the implementation is simply to repair each tablet and return
the minimum repair_time as the shared repair time.
2025-07-01 13:20:18 +03:00
Yaron Kaikov
fd0e044118 Update ScyllaDB version to: 2025.4.0-dev 2025-07-01 11:33:20 +03:00
Jenkins Promoter
94d7c22880 Update pgo profiles - aarch64 2025-07-01 11:33:20 +03:00
Jenkins Promoter
7531fc72a6 Update pgo profiles - x86_64 2025-07-01 11:33:20 +03:00
Nadav Har'El
e12ff4d3ab Merge 'LWT: use tablet_metadata_guard' from Petr Gusev
This PR is a step towards enabling LWT for tablet-based tables.

It pursues several goals:
* Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`.
* Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly.
* Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399)

Backport: not needed (new feature).

Closes scylladb/scylladb#24495

* github.com:scylladb/scylladb:
  LWT: make cas_shard non-optional in sp::cas
  LWT: create cas_shard in select_statement
  LWT: create cas_shard in modification and batch statements
  LWT: create cas_shard in alternator
  LWT: use cas_shard in storage_proxy::cas
  do_query_with_paxos: remove redundant cas_shard check
  storage_proxy: add cas_shard class
  sp::cas_shard: rename to get_cas_shard
  token_metadata_guard: a topology guard for a token
  tablet_metadata_guard: mark as noncopyable and nonmoveable
2025-07-01 11:33:20 +03:00
Gleb Natapov
a221b2bfde gossiper: do not assume that id->ip mapping is available in failure_detector_loop_for_node
failure_detector_loop_for_node may be started on a shard before id->ip
mapping is available there. Currently the code treats missing mapping
as an internal error, but it uses its result for debug output only, so
lets relax the code to not assume the mapping is available.

Fixes #23407

Closes scylladb/scylladb#24614
2025-07-01 11:33:20 +03:00
Pavel Emelyanov
26c7f7d98b Merge 'encryption_at_rest_test: Fix some spurious errors' from Calle Wilund
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

Closes scylladb/scylladb#24633

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-01 11:33:20 +03:00
Avi Kivity
6aa71205d8 repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726
2025-07-01 11:33:19 +03:00
Pavel Emelyanov
6826856cf8 Merge 'test.py: Fix start 3rd party services' from Andrei Chekun
Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services.
Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever.

This functionality in 2025.2 and can potentially affect jobs, so backport needed.

Closes scylladb/scylladb#24734

* github.com:scylladb/scylladb:
  test.py: use unique hostname for Minio
  test.py: Catch possible exceptions during 3rd party services start
2025-07-01 11:33:19 +03:00
Anna Stuchlik
9234e5a4b0 doc: add the SBOM page and the download link
This commit migrates the Software Bill Of Materials (SBOM) page
added to the Enterprise docs with https://github.com/scylladb/scylla-enterprise/pull/5067.

The only difference is the link to the SBOM files - it was Enterprise SBOM in the Enterprise docs,
while here is a link to the ScyllaDB SBOM.

It's a follow-up of migration to Source Avalable and should be backported
to all Source Available versions - 2025.1 and later.

Fixes https://github.com/scylladb/scylladb/issues/24730

Closes scylladb/scylladb#24735
2025-07-01 11:33:19 +03:00
Michael Litvak
c74cbca7cb tablets: save and read tablet metadata of co-located tables
Update the tablet metadata save and read methods to work with tablet
maps of co-located tables.

The new function colocated_tablet_map_to_mutation is used to generate a
mutation of a co-located table to system.tablets. It creates a static
row with the base_table column set with the base table id. The function
save_tablet_metadata is updated to use this function for co-located
tables.

When reading tablet metadata from the table, we handle the new case of
reading a co-located table. We store the co-located tables relationships
in the tablet_metadata_builder's `colocated_tables` map, and process it
in on_end_of_stream. The reason we defer the processing is that we want
to set all normal tablet maps first, to ensure the base tablet map is
found when we process a co-located table.
2025-07-01 10:29:59 +03:00
Michael Litvak
ddfe5dfb6b tablets: represent co-located tables in tablet metadata
Modify tablet_metadata to be able to represent co-located tables.

The new method set_colocated_table adds to tablet_metadata a table which
is co-located with another table. A co-located table shares the tablet
map object with the base table, so we just create a copy of the shared
tablet map pointer and store it as the co-located table's tablet map.

Whenever a tablet map is modified we update the pointer for all the
co-located tables accordingly, so the tablet map remains shared.

We add some data structures to tablet_metadata to be able to work with
co-located table groups efficiently:
* `_table_groups` maps every base table to all tables in its
  co-location group. This is convenient for iterating over all table
  groups, or finding all tables in some group.
* `_base_table` maps a co-located table to its base table.
2025-07-01 10:29:59 +03:00
Michael Litvak
4777444024 tablets: add base_table column to system.tablets
Add a new column base_table to the system.tablets table.

It can be set to point to another table to indicate that the tablets of
this table are co-located with the tablets of the base table.

When it's set, we don't store other tablet information in system.tablets
and in the in-memory tablet map object for this table, and we need to
refer instead to the base table tablet information. The method
get_tablet_map always returns the base tablet map.
2025-07-01 10:29:59 +03:00
Michael Litvak
4e2742a30b docs: update system.tablets schema
The schema of system.tablets in the docs is outdated. replace it with
the current schema.
2025-07-01 10:29:59 +03:00
Dawid Mędrek
d4315e4fae cql3/statements/describe_statement.cc: Don't copy descriptions
In an upcoming commit, `cql3::description` is going to become
a move-only type. These changes are a prerequisite for it:
we get rid of all places in the file where we copy its instances
and start moving them instead.
2025-06-30 19:12:14 +02:00
Dawid Mędrek
9472da3220 cql3: Use managed_bytes instead of bytes in DESCRIBE
This is a prerequiste for a following commit. We want
to move towards using non-contiguous memory chunks
to avoid making large allocations.

This commit does NOT change the behavior of Scylla
at all. The rows corresponding to the result of a DESCRIBE
statement are represented by an instance of `result_set`.
Before these changes, we encoded descriptions using `bytes`
and then passed them into a `result_set` using its method
`add_row`. What it does is turn the instances of `bytes`
into instances of `managed_bytes` and append them at the
end of its internal vector. In these changes, we do it
on our own and use another overload of the method.
2025-06-30 19:12:14 +02:00
Dawid Mędrek
9cc3d49233 utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
Currently, we use `managed_bytes` to represent fragmented sequences of bytes.
In some cases, the type corresponds to generic bytes, while in some other cases
-- to strings of actual text. Because of that, it's very easy to get confused
what use `managed_bytes` serve in a specific piece of code. We should avoid it.

In this commit, we're introducing basic wrappers over `managed_bytes` and
`bytes_ostream` with a promise that they represent UTF-8-encoded strings.
The interface of those types are pretty basic, but they should be sufficient
for the most common use: filling a stream with characters and then extracting
a fragmented buffer from it.
2025-06-30 19:12:08 +02:00
Calle Wilund
8d37e5e24b encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.
2025-06-30 11:36:38 +00:00
Calle Wilund
ee98f5d361 encryption: Ensure stopping timers in provider cache objects
utils::loading cache has a timer that can, if we're unlucky, be runnnig
while the encryption context/extensions referencing the various host
objects containing them are destroyed in the case of unit testing.

Add a stop phase in encryption context shutdown closing the caches.
2025-06-30 11:36:38 +00:00
Anna Stuchlik
b61641cf57 doc: remove support for Ubuntu 20.04
Fixes https://github.com/scylladb/scylladb/issues/24564

Closes scylladb/scylladb#24565
2025-06-30 12:33:29 +02:00
Evgeniy Naydanov
8c981354a7 test.py: dtest: make auth_test.py run using test.py
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests and tests marked with
`required_features("!consistent-topology-changes")` marker.

Remove `test_permissions_caching` test because it's too
flaky when running using test.py

Also, make few time execution optimizations:
  - remove redundant `time.sleep(10)`
  - use smaller timeouts for CQL sessions

Enable the test in suite.yaml (run in dev mode only)
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
e30e2345b7 test.py: dtest: rework wait_for_any_log()
Make `wait_for_any_log()` function to work closer to the original
dtest's version: use `ScyllaLogFile.grep()` method instead of
the usage of `ScyllaNode.wait_log_for()` with a small timeout to
have at least one try to find.

Also, add `max_count` argument to `.grep()` method for the
optimization purpose.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
b5d44c763d test.py: dtest: add support for bootstrap parameter for new_node
Technically, `new_node()`'s `bootstrap` parameter used to mark a node
as a seed if it's False.  In test.py, seeds parameter passed on start of
a node, so, save it as `ScyllaNode.bootstrap` attribute to use in
`ScyllNode.start()` method.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
d0d2171fa4 test.py: dtest: add generate_cluster_topology() function
Copy generate_cluster_topology() function from tools/cluster_topology.py
module.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
28d9cdef1b test.py: dtest: add ScyllaNode.set_configuration_options() method
Implement the method using slightly modified `set_configuration_options()`
method of `ScyllaCluster`.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
a1ce3aed44 test.py: pylib/manager_client: support batch config changes
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair.  All internal machinery converted to get a values dict as a
parameter.  Type hints were adjusted too.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
ce9fc87648 test.py: dtest: copy unmodified auth_test.py 2025-06-30 10:06:32 +00:00
Evgeniy Naydanov
702409f7b2 test.py: dtest: add missed markers to pytest.ini
`exclude_errors` and `cluster_options` are used in `audit_test.py`
2025-06-30 10:06:32 +00:00
Andrei Chekun
c6c3e9f492 test.py: use unique hostname for Minio
To avoid situation that port is occupied on localhost, use unique
hostname for Minio
2025-06-30 12:03:06 +02:00
Andrei Chekun
0ca539e162 test.py: Catch possible exceptions during 3rd party services start
With this change if something will go wrong during starting services,
they are still will be shuted down on the finally clause. Without it can
hang forever
2025-06-30 12:00:23 +02:00
Petr Gusev
35aba76401 LWT: make cas_shard non-optional in sp::cas
We also make sp::cas_shard function local since it's now
not used directly by sp clients.
2025-06-30 10:37:33 +02:00
Petr Gusev
3d262d2be8 LWT: create cas_shard in select_statement
In this commit we create cas_shard in select_statement
and pass it to the sp::query_result function.
2025-06-30 10:37:33 +02:00
Petr Gusev
736fa05b17 LWT: create cas_shard in modification and batch statements
We create cas_shard before the shard check to protect
against concurrent tablet migrations.
2025-06-30 10:37:33 +02:00
Petr Gusev
7e64852bfd LWT: create cas_shard in alternator
We create cas_shard instance in shard_for_execute(). This implies that
the decision about the correct shard was made using the specific
token_metadata_guard, and it remains valid only as long as the guard
is held.

When forwarding a request to another shard, we keep the original
cas_shard alive. This ensures that the target shard
remains a valid owner for the given token.

Fixes scylladb/scylladb#17399
2025-06-30 10:37:33 +02:00
Petr Gusev
deb7afbc87 LWT: use cas_shard in storage_proxy::cas
Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler.

We make cas_shard parameter optional in storage_proxy methods
to make the refactoring easier. The sp::cas method constructs a new
token_metadata_guard if it's not set. All call sites pass null
in this commit, we will add the proper implementation in the next
commits.
2025-06-30 10:33:17 +02:00
Petr Gusev
94f0717a1e do_query_with_paxos: remove redundant cas_shard check
The same check is done in the sp::cas method.
2025-06-30 10:33:17 +02:00
Petr Gusev
43c4de8ad1 storage_proxy: add cas_shard class
The sp::cas method must be called on the correct shard,
as determined by sp::cas_shard. Additionally, there must
be no asynchronous yields between the shard check and
capturing the erm strong pointer in sp::cas. While
this condition currently holds, it's fragile and
easy to break.

To address this, future commits will move the capture of
token_metadata_guard to the call sites of sp::cas, before
performing the shard check.

As a first step, this commit introduces a cas_shard class
that wraps both the target shard and a token_metadata_guard
instance. This ensures the returned shard remains valid for
the given tablet as long as the guard is held.
In the next commits, we’ll pass a cas_shard instance
to sp::cas as a separate parameter.
2025-06-30 10:33:17 +02:00
Nadav Har'El
7db5e9a3e9 test/cqlpy: reproducer for decimal parsing with very high exponent
This patch adds tests reproducing issue #24581, where Scylla incorrectly
parsed "decimal"-type literals in CQL with very high exponents, near or
above the 32-bit limit.

For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649,
while it should be (as we explain in comments in the test) an error.

The tests in this patch failed (in multiple checks) before #24581 was
fixed, and pass after it was fixed.

These tests all pass on Cassandra 3, confirming our understanding on the
limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due
to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723
in Cassandra, that mistakenly limited "decimal" exponents to just 309.

Refs #24581

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24646
2025-06-30 10:37:13 +03:00
Anna Stuchlik
b7683d0eba doc: remove duplicated content
This commit removes the Non-Reserved CQL Keywords and Reserved CQL Keywords pages-keyword
as that content is already covered on the Appendices page.
Redirections are added to avoid 404s for the removed pages.

In addition, the Appendices page title is extended with "Reserved CQL Keywords and Types"
to help users understand what those appendices are about.

Fixes https://github.com/scylladb/scylladb/issues/24319

Closes scylladb/scylladb#24320
2025-06-30 10:30:13 +03:00
Andrzej Jackowski
c8ab5928a3 test: audit: synchronize audit syslog server
In audit tests, UnixDatagramServer is used to receive audit logs.
This commit introduces a synchronization between the logs receiver and
a function that reads already received logs. Without this, there was
a race condition that resulted in test failures (e.g., audit logs were
missing during assertion check).
2025-06-30 09:19:26 +02:00
Andrzej Jackowski
fcd88e1e54 docs: audit: update syslog audit format to the current one
The documentation of the syslog audit format was not updated when
scylladb#23099 and earlier audit log changes were introduced.
This commit includes the missing update.
2025-06-30 09:19:25 +02:00
Andrzej Jackowski
422b81018d audit: bring back commas to audit syslog
When the audit syslog format was changed in scylladb#23099, commas were
removed. This made the syslog format inconsistent, as LOGIN audit logs
contained commas while other audit logs did not. Additionally, the lack
of commas was not aligned with the audit documentation.

This commit brings back the use of commas in the audit syslog format
to ensure consistency across all types of audit logs.

Fixes: scylladb#24410
2025-06-30 09:19:25 +02:00
Botond Dénes
ee6d7c6ad9 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638
2025-06-30 10:08:49 +03:00
Avi Kivity
07c5edcc30 tools: add patchelf utility
We use patchelf to rewrite the dynamic loader (known as the interpreter)
of the binaries we ship, so we can point to our shipped dynamic loader,
which is compatible with our binaries, rather than rely on the distribution's
dynamic loader, which is likely to be incompatible.

Upstream patchelf losing compatibity [1] with Linux 5.17 and below.
This change was also picked up by Fedora 42, so we cannot update the
toolchain to that distribution until we have an alternative.

Here we add a minimal patchelf alternative. It was mostly written by
Claude. It is minimal in that it only supports --set-interpreter and
--print-interpreter, and works well enough for our needs. We still use
the original patchelf for --remove-rpath; this reduces our maintenance
needs.

[1] 43b75fbc9f
[2] 4b015255d1

Closes scylladb/scylladb#24695
2025-06-30 07:24:05 +03:00
Avi Kivity
e2cda38b0f Merge 'alternator: improve, document and test table/index name lengths' from Nadav Har'El
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations:
 1. A table name must be up to 222 characters.
 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters.
 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters.

The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one).

The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior.

No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm.

Closes scylladb/scylladb#24597

* github.com:scylladb/scylladb:
  test/alternator: reproducer for streams bug with long table name
  alternator: improve, document and test table/index name lengths
2025-06-29 18:53:48 +03:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Avi Kivity
48d9f3d2e3 Merge 'mutation: check key of inserted rows' from Botond Dénes
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

Closes scylladb/scylladb#24497

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-06-29 18:10:17 +03:00
Pavel Emelyanov
ef396ecf7a api: Reserve resulting vector with schema versions
The get_schema_versions handler gets unordered_map from storage service,
then converts it to API returning type, which is a vector. This vector
can be reserved, the final number of elements is known in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24715
2025-06-29 14:37:45 +03:00
Nadav Har'El
e7257b1393 test/alternator: make "run" script use only_rmw_uses_lwt
Originally (since commit c3da9f2), Alternator's functional test suite
(test/alternator) ran "always_use_lwt" write isolation mode. The original
thinking was that we need to exercise this more difficult mode and it's
the most important mode. This mode was originally chosen in
test/alternator/run.

However, starting with commit 76a766c (a year ago), test.py no longer
runs test/alternator/run. Instead, it runs Scylla itself, and the options
for running Scylla appear in test/alternator/suite.yaml, and accidentally
the write isolation mode only_rmw_uses_lwt was chosen there.

The purpose of this patch is to reconcile this difference and use the
same mode in test.py (which CI is using) and test/alternator/run (which
is only used by some developers, during development).

I decided to have this patch change test/alternator/run to use
only_rmw_uses_lwt. As noted above, this is anyway how all Alternator
tests have been running in CI in the past year (through test.py).
Also, the mode only_rmw_uses_lwt makes running the Alternator test
suite slightly faster (52 seconds instead of 58 seconds, on my laptop)
which is always nice for developers.

This patch changes nothing for testing in CI - only manual runs through
test/alternator/run are affected.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 13:58:58 +03:00
Nadav Har'El
3fd2493bc9 test/alternator: improve tests for write isolation modes
Before this patch, we had in test_condition_expression.py and
test_update_expression.py some rudimentary tests that the different
write isolation modes behave as expected. Basically, we wanted to test
that read-modify-write (RMW) operations are recognized and forbidden
in forbid_rmw mode, but work correctly in the three other modes.
We only check non-concurrent writes, so the actual write isolation is
NOT checked, just the correctness of non-concurrent writes.

However, since these tests were split across several files, and many
of the tests just ran other existing tests in different write isolation
modes, it was hard to see what exactly was being tested, and what was
missed. And indeed we missed checking some RMW operations, such as
requests with ReturnValues, requests with the older Expected or
AttributeUpdates (only the newer ConditionExpression and UpdateExpression
were tested), and ADD and DELETE operations in UpdateExpression.

So this patch replaces the existing partial tests with a new test file
test_write_isolation.py dedicated to testing all kinds of RMW operations
in one place, and how they don't work in forbid_rmw and do work in
the other modes. Writing all these tests in one place made it easier
to create a really exhaustive test of all the different operations and
optional parameters, and conversely - make sure that we don't test
*unnecessary* things such as different ConditionExpression expressions
(we already have 1800 lines of tests for ConditionExpression, and the
actual content of the condition is unrelated to write isolation modes).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 13:58:38 +03:00
Nadav Har'El
50d370f06e test/alternator: reproducer for streams bug with long table name
The two tests in this patch reproduce issue #24598: When enabling
Alternator streams on an Alternator table with a very long name,
such as the maximum allowed name length 222, the result is an
I/O error and a Scylla shutdown.

The two tests are currently marked "skip", otherwise they would
crash the Scylla being tested.

Refs #24598

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 11:40:55 +03:00
Nadav Har'El
0ce0b2934f alternator: improve, document and test table/index name lengths
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255
characters each, Alternator currently has different (and lower)
limitations:
 1. A table name must be up to 222 characters.
 2. For a GSI, the sum of the table's and GSI's name length, plus 1,
    must be up to 222 characters.
 3. For an LSI, the sum of the table's and LSI's name length, plus 2,
    must be up to 222 characters.

These specific limitations were never documented, so in this patch we
add this information to docs/alternator/compatibility.md.

Moreover, these limitations where only partially tested, so in this patch
we add testing for more cases that we forgot to check - such as length
of LSI names (only GSI were checked before this patch), or adding a
GSI to an existing table. It is important to check all these corner
cases because there is a risk that if we attempt to create a table
without checking its length, we can end up with an I/O error that brings
down Scylla.

In one case - UpdateTable adding a GSI to an existing table - the new
test exposed a trivial bug: Because UpdateTable wants to verify the new
GSI doesn't have the same name as an existing LSI, it mistakenly applied
the LSI's length name limit instead of the GSI's name length limit,
which is one byte less than it should be. So this patch fixes this
trivial bug as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 11:40:55 +03:00
Emil Maskovsky
c6307aafd5 test.py: handle cancellation gracefully to avoid TypeError
Previously, if test execution was cancelled, `run_all_tests()` could
return `None`. This caused a `TypeError` when the result was
unconditionally unpacked into `total_tests_pytest, failed_pytest_tests`.

This commit updates the code to handle the cancellation appropriately,
preventing the confusing `TypeError` exception and ensuring clean
cancellation behavior.

Closes scylladb/scylladb#24624
2025-06-27 20:14:35 +03:00
Pavel Emelyanov
23d86ede72 Merge 'audit: introduce debug level logs on happy path' from Dario Mirovic
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.

Fixes: https://github.com/scylladb/scylladb/issues/23773

No backport needed - this is a small quality-of-life improvement.

Closes scylladb/scylladb#24658

* github.com:scylladb/scylladb:
  audit: change audit test logger level to `debug`
  audit: introduce debug level logs on happy path
2025-06-27 20:10:54 +03:00
Anna Stuchlik
2367330513 doc: remove OSS mention from the SI notes
This commit removes a confusing reference to an Open Source version
form the Local Secondary Indexes page.

Fixes https://github.com/scylladb/scylladb/issues/24668

Closes scylladb/scylladb#24673
2025-06-27 20:07:51 +03:00
Anna Stuchlik
7537f5f260 doc: fix the headings in the Admin Guide
This commit fixes incorrect headings in the Admin Guide and the files
that are included in that guide.

The purpose is to properly organize the content and improve the search,
as well as prevent potential build problems caused by a poor heading organization.

Fixes https://github.com/scylladb/scylladb/issues/24441

Closes scylladb/scylladb#24700
2025-06-27 20:07:09 +03:00
Dario Mirovic
ec6249b581 audit: change audit test logger level to debug
Audit module tests should show the `debug` level messages.
This change makes audit_test.py `audit` module log level to `debug`.

Closes scylladb/scylladb#23773
2025-06-27 16:27:33 +02:00
Dario Mirovic
666364f651 audit: introduce debug level logs on happy path
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.

Ref: scylladb/scylladb#23773
2025-06-27 16:27:27 +02:00
Botond Dénes
495f607e73 test/cluster/test_read_repair: write 100 rows in trace test
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).

Fixes: #24330

Closes scylladb/scylladb#24403
2025-06-27 16:23:08 +03:00
Pavel Emelyanov
4c0154f156 Merge 'test.py: enhance allure reporting' from Andrei Chekun
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).

Backport is not needed, this is only framework enhancements

Closes scylladb/scylladb#24677

* github.com:scylladb/scylladb:
  test.py: Attach node logs in allure report in case of fail
  test.py: Add run id to the boost output file
2025-06-27 16:22:03 +03:00
Botond Dénes
e715a150b9 tools/scylla-nodetool: backup: add --move-files parameter
Allow opting in for backup to move the files instead of copying them.

Fixes: https://github.com/scylladb/scylladb/issues/24372

Closes scylladb/scylladb#24503
2025-06-27 16:21:39 +03:00
Piotr Dulikowski
9d70e7a067 Merge 'docs: document the new recovery procedure' from Patryk Jędrzejczak
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

The new recovery procedure requires the Raft-based topology to be
enabled, so to remove the old procedure from the documentation,
we must assume users have the Raft-based topology enabled.
We can do it in 2025.2 because the upgrade guides to 2025.1 state that
enabling the Raft-based topology is a mandatory step of the upgrade.
Another reminder is the upgrade guides to 2025.2.

Since we rely on the Raft-based topology being enabled, we remove the
obsolete parts of the documentation.

We will make the Raft-based topology mandatory in the code in the
future, hopefully in 2025.3. For this reason, we also don't touch the
dev docs in this PR.

Fixes scylladb/scylladb#24530

Requires backport to 2025.2 because 2025.2 contains the new recovery
procedure.

Closes scylladb/scylladb#24583

* github.com:scylladb/scylladb:
  docs: rely on the Raft-based topology being enabled
  docs: handling-node-failures: document the new recovery procedure
2025-06-26 17:07:37 +02:00
Gleb Natapov
5f953eb092 storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651
2025-06-26 17:06:02 +02:00
Andrei Chekun
2c726c5074 test.py: Attach node logs in allure report in case of fail
Currently, allure report have no nodes logs in case of fail, this will
allow to view the logs in one place without going anywhere else.
2025-06-26 15:37:33 +02:00
Piotr Dulikowski
2f7ed8b1d4 Merge 'Fix for cassandra role gets recreated after DROP ROLE' from Marcin Maliszkiewicz
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.

Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.

If server is started without cluster quorum we'll skip creating role or password.

Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2

Closes scylladb/scylladb#24451

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for password reset procedure
  auth: cache roles table scan during startup
  test: auth_cluster: add test for replacing default superuser
  test: pylib: add ability to specify default authenticator during server_start
  test: pylib: allow rolling restart without waiting for cql
  auth: split auth-v2 logic for adding default superuser password
  auth: split auth-v2 logic for adding default superuser role
  auth: ldap: fix waiting for underlying role manager
  auth: wait for default role creation before starting authorizer and authenticator
2025-06-26 14:36:25 +02:00
Lakshmi Narayanan Sreethar
279253ffd0 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640
2025-06-26 15:29:28 +03:00
Patryk Jędrzejczak
203ea5d8f9 docs: rely on the Raft-based topology being enabled
In 2025.2, we don't force enabling the Raft-based topology in the code,
but we stated in the upgrade guides that it's a mandatory step of the
upgrade to 2025.1. We also remind users to enable the Raft-based
topology in the upgrade guides to 2025.2. Hence, we can rely in the
the documentation on the Raft-based topology being enabled. If it is
still disabled, we can just send the user to the upgrade guides. Hence:
- we remove all documentation related to enabling the Raft-based
  topology, enabling the Raft-based schema (enabled Raft-based topology
  implies enabled Raft-based schema), and the gossip-based topology,
- we can replace the documentation of the old manual recovery procedure
  with the documentation of the new manual recovery procedure (done in
  the previous commit).
2025-06-26 14:17:54 +02:00
Patryk Jędrzejczak
4e256182a0 docs: handling-node-failures: document the new recovery procedure
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

We can get rid of the old procedure from the documentation because
we requested users to enable the Raft-based topology during upgrades to
2025.1 and 2025.2.

We leave the note that enabling the Raft-based topology is required to
use the new recovery procedure just in case, since we didn't force
enabling the Raft-based topology in the code.
2025-06-26 14:17:50 +02:00
Andrei Chekun
156e7d2e7a test.py: Add run id to the boost output file
To avoid overwriting the output tests adding the run id to it.
Previously, when first repeat failed and the second passes, because the
are using the same name for the output, it will be overwritten and
deleted since the second repeat passed
2025-06-26 12:51:15 +02:00
Marcin Maliszkiewicz
5e7ac34822 test: auth_cluster: add test for password reset procedure 2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
0ffddce636 auth: cache roles table scan during startup
It may be particularly beneficial during connection
storms on startup. In such cases, it can happen that
none of the user's read requests succeed, preventing
the cache from being populated. This, in turn, makes
it more difficult for subsequent reads to
succeed, reducing resiliency against such storms.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
67a4bfc152 test: auth_cluster: add test for replacing default superuser
This test demonstrates creating custom superuser guide:
https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
a3bb679f49 test: pylib: add ability to specify default authenticator during server_start
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
d9ec746c6d test: pylib: allow rolling restart without waiting for cql
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
f85d73d405 auth: split auth-v2 logic for adding default superuser password
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (passwords set)

There may be a case when user deletes a superuser or password
right before restarting a node, in such case we may ommit
updating a password but:
- this is a trade-off between quorum reads on startup
- it's far more important to not update password when it shouldn't be
- if needed password will be updated on next node restart

If there is no quorum on startup we'll skip creating password
because we can't perform any raft operation.

Additionally this fixes a problem when password is created despite
having non default superuser in auth-v2.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
2e2ba84e94 auth: split auth-v2 logic for adding default superuser role
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
  - old code may be less tested now so it's best to not change it
  - new code path avoids quorum selects in a typical flow (roles set)

This fixes a problem when superuser role is created despite
having non default superuser in auth-v2.

If there is no quorum on startup we'll skip creating role
because we can't perform any raft operation.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
c96c5bfef5 auth: ldap: fix waiting for underlying role manager
ldap_role_manager depends on standard_role_manager,
therefore it needs to wait for superuser initialization.
If this is missing, the password authenticator will start
checking the default password too early and may fail to
create the default password if there is no default
role yet.

Currently password authenticator will create password
together with the role in such case but in following
commits we want to separate those responsibilities correctly.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
68fc4c6d61 auth: wait for default role creation before starting authorizer and authenticator
There is a hidden dependency: the creation of the default superuser role
is split between the password authenticator and the role manager.
To work correctly, they must start in the right order: role manager first,
then password authenticator.
2025-06-26 12:28:08 +02:00
Piotr Dulikowski
62efe6616a Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski
The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time.

The new algorithm can be summarized as follows:
 1. Prepare a tablet_replica -> partition_range mapping where the values     cover the entire space.
 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each
artition range (in such cases, the partition range is assigned to the appropriate tablet_replica).

In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host.

In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch.

In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range.

Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits.

This patch series:
 - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets).
 - Implements tablet-aware dispatching algorithm.
 - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard.
 - Adds test_long_query_timeout_erm to verify the new functionality.

Fixes: scylladb#21831

No backport, as it is rather new feature than a bugfix.

Closes scylladb/scylladb#24383

* github.com:scylladb/scylladb:
  mapreduce: add missing comma and space in mapreduce_request operator<<
  mapreduce: add shard_id_hint to mapreduce request
  test: add test_long_query_timeout_erm
  mapreduce: add tablet-aware dispatching algorithm
  storage_proxy: make storage_proxy::is_alive public
  mapreduce: remove _shared_token_metadata from mapreduce_service
  mapreduce: move dispatching logic to dispatch_to_vnodes
  mapreduce: remove underscores from variable names
  mapreduce: move req_with_modified_pr handling to a new function
  mapreduce: change next_vnode lambda to get_next_partition_range function
2025-06-26 12:25:39 +02:00
Avi Kivity
947906e6fd Merge 'Make uuid sstable generations mandatory' from Benny Halevy
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.

Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.

Refs #24248

* Enhancement.  No backport required

Closes scylladb/scylladb#24554

* github.com:scylladb/scylladb:
  feature_service: never disable UUID_SSTABLE_IDENTIFIERS
  test: sstable_move_test: always use uuid sstable generation
  test: sstable_directory_test: always use uuid sstable generation
  sstables: sstable_generation_generator: set last_generation=0 by default
  test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
  test: lib: test_env: always use uuid sstable generation
  test: sstable_test: always use uuid sstable generation
  test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
  test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
  test: sstable_compaction_test: always use uuid sstable generation
2025-06-26 12:25:38 +02:00
Szymon Malewski
f28bab741d utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619
2025-06-26 12:25:38 +02:00
Pavel Emelyanov
0f5b358c47 test: Use test sched groups, not database ones
Some tests want to switch between sched groups. For that there's
cql-test-env facility to create and use them. However, there's a test
that uses replica::database as sched groups provider, which is not nice.
Fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24615
2025-06-26 12:25:38 +02:00
Avi Kivity
ff508ce82c Merge 'sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Botond Dénes
Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.

Fixes https://github.com/scylladb/scylladb/issues/20845

We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future

Closes scylladb/scylladb#24534

* github.com:scylladb/scylladb:
  sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
  sstables/exceptions: introduce parse_assert()
2025-06-26 12:25:38 +02:00
Ferenc Szili
96267960f8 logging: Add row count to large partition warning message
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125

This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.

The warning for large partitions and row counts will now look like this:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

Closes scylladb/scylladb#22010
2025-06-26 12:25:38 +02:00
Yaniv Michael Kaul
198ecd8039 Do not perform blkdiscard by default on the disks during RAID setup.
This is not needed on clean disks, which is often the case with cloud instances, but can be useful on bare metal servers with disks that were used before.
Therefore, the default is to skip blkdiscard operation, which makes overall installation faster.
If the user wishes to run it anyway, use the newly introduced --blkdiscard option of scylla_raid_setup to perform it.

Note: since we either perform online discard or schedule fstrim, the (previously used) space will gradually get trimmed, this way or another.

Fixes: https://github.com/scylladb/scylladb/issues/24470
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#24579
2025-06-26 12:25:38 +02:00
Piotr Dulikowski
23f0d275c8 Merge 'generic_server: fix connections semaphore config observer' from Marcin Maliszkiewicz
In ed3e4f33fd we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it.

When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer.

Backport: to 2025.2 where this feature was added
Fixes: https://github.com/scylladb/scylladb/issues/24557

Closes scylladb/scylladb#24484

* github.com:scylladb/scylladb:
  test: add test for live updates of generic server config
  utils: don't allow do discard updateable_value observer
  generic_server: fix connections semaphore config observer
2025-06-26 12:25:38 +02:00
Andrzej Jackowski
ba6ed45d7f mapreduce: add missing comma and space in mapreduce_request operator<<
This change is introduced to fix the broken formating of
mapreduce_request `operator<<`. Due to lack of ", " before "cmd"
the output was `reductions=[...]cmd=read_command{...}` instead of
`reductions=[...], cmd=read_command{...}`.
2025-06-25 19:23:07 +02:00
Andrzej Jackowski
26403df9ea mapreduce: add shard_id_hint to mapreduce request
If a partition range is not present locally,
`partition_ranges_owned_by_this_shard` assigns it to shard 0, which can
overload shard 0. To address this, this commit adds a `shard_id_hint`
to the mapreduce request. When `shard_id_hint` is set, the entire
partition range in the request is handled by the specified shard.

The `shard_id_hint` is set by the new tablet-aware mapreduce algorithm,
introduced in `dispatch_to_tablets`. This algorithm balances the
workload across shards, so the changes in this commit ensure that
load balancing is preserved, even during events such as tablet splits.

Fixes: scylladb#21831
2025-06-25 19:23:07 +02:00
Andrzej Jackowski
5f31011111 test: add test_long_query_timeout_erm
This test verifies the effectiveness of the mechanism for releasing ERM
introduced in this patch series. In test scenario, during processing of
a query in mapreduce service, reads are intentionally blocked by
an injected error. However, when table uses tablets, ERM is now often
released by the mapreduce service, so the topology is not blocked to the
end of the request. As a result, it is possible to add a new node
before the query finishes.

Refs. scylladb#21831
2025-06-25 19:22:48 +02:00
Robert Bindar
6e7cab5b45 Add repository layout dev documentation
This change adds an md file which gives a high
level overview of the scylladb repository, the
components each path contains and a basic description
for each one of them. This is mainly intended for
onboarding engineers to help get a mental picture when
starting ramping up on Scylla concepts.

Refs #22908

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23010
2025-06-25 13:58:05 +03:00
Patryk Jędrzejczak
cc8c618356 Merge 'LWT for tablets: fix paxos state for intranode migration' from Petr Gusev
This PR fixes the "intra-node tablet migration" issue from the [LWT over tablets spec](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.uk3mizf7gvs1). We make `get_replica_lock` to acquire locks on both shards to avoid races. We also implement read_repair for paxos state -- if `load_paxos_state` returns different states on two shards, we 'repair' it by choosing the values with maximum timestamp and writing the 'repaired' state to both shards.

LWT for tablets is not enabled yet. It requires migrating paxos state to colocated tablets, which is blocked on  [this PR](https://github.com/scylladb/scylladb/pull/22906).

Regarding testing:
* We could possibly arrange a test case for the locking commit through some error injection magic. We'll return to this when LWT for tablets is enabled.
* We can't think of a clear test case for the read_repair commit. Any suggestions are welcome (@gleb-cloudius).

Backport: no need, since it's a new feature.

Closes scylladb/scylladb#24478

* https://github.com/scylladb/scylladb:
  paxos_state: read repair for intranode_migration
  paxos_state: fix get_replica_lock for intranode_migration
2025-06-25 11:08:39 +02:00
Sergey Zolotukhin
0d7de90523 Fix regexp in check_node_log_for_failed_mutations
The regexp that was added in https://github.com/scylladb/scylladb/pull/23658 does not work as expected:
`TRACE`, `INFO` and `DEBUG` level messages are not ignored.
This patch corrects the pattern to ensure those log levels are excluded.

Fixes scylladb/scylladb#23688

Closes scylladb/scylladb#23889
2025-06-25 12:00:16 +03:00
Anna Stuchlik
592d45a156 doc: remove references to Open Source from README
This commit removes the references to ScyllaDB Open Source from the README file for documentation.
In addition, it updates the link where the documentation is currently published.

We've removed Open Source from all the documentation, but the README was missed.
This commit fixes that.

Closes scylladb/scylladb#24477
2025-06-25 11:38:46 +03:00
Michał Chojnowski
cace55aaaf test_sstable_compression_dictionaries_basic.py: fix a flaky check
test_dict_memory_limit trains new dictionaries and checks (via metrics)
that the old dictionaries are appropriately cleaned up.
The problem is that the cleanup is asynchronous (because the lifetimes
are handled by foreign_ptr, which sends the destructor call
to the owner shard asynchronously), so the metrics might be
checked a few milliseconds before the old dictionary is cleaned up.

The dict lifetimes are lazy on purpose, the right thing to do is
to just let the test retry the check.

Fixes scylladb/scylladb#24516

Closes scylladb/scylladb#24526
2025-06-25 11:30:28 +03:00
Amnon Heiman
51cf2c2730 api/failure_detector.cc: stream endpoints
Previously, get_all_endpoint_states accumulated all results in memory,
which could lead to large allocations when dealing with many endpoints.

This change uses the stream_range_as_array helper to stream the results.

Fixes #24386

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#24405
2025-06-25 11:28:37 +03:00
Guy Shtub
71ba1f8bc9 docs: update third party driver list with Exandra Elixir driver
Closes scylladb/scylladb#24260
2025-06-25 11:27:03 +03:00
Kefu Chai
e212b1af0c build: add p11-kit's cflags to user_cflags instead of args.user_cflags
Fix an issue introduced in commit 083f7353 where p11-kit's compiler flags were
incorrectly added to `args.user_cflags` instead of `user_cflags`. This created
the following problem:

When using CMake generation mode, these flags were added to `CMAKE_CXX_FLAGS`,
causing them to be passed to all compiler invocations including linking stages
where they were irrelevant.

This change moves p11-kit's cflags to `user_cflags`, which ensures the flags are
correctly included in compilation commands but not in linking commands. This
maintains the proper behavior in the ninja build system while fixing the issue in
the CMake build system.

`args.user_cflags` is preserved for its intended purpose of storing user-specified
compiler flags passed via command line options.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23988
2025-06-25 11:24:09 +03:00
Andrzej Jackowski
ea2bdae45a mapreduce: add tablet-aware dispatching algorithm
The primary goal of this change is to reduce the time during which the
Effective Replication Map (ERM) is retained by the mapreduce service.
This ensures that long aggregate queries do not block topology
operations. As ScyllaDB transitions towards tablets, which simplify
work dispatching, the new algorithm is designed specifically for
tablets.

The algorithm divides work so that each `tablet_replica` (a <host,
shard> pair) processes two tablets at a time. After processing of each
`tablet_replica`, the ERM is released and re-acquired.

The new algorithm can be summarized as follows:
1. Prepare a set of exclusive `partition_ranges`, where each range
   represents one tablet. This set is called `ranges_left`, because it
   contains ranges that still need processing.
2. Loop until `ranges_left` is empty:
   I.  Create `tablet_replica` -> `ranges` mapping for the current ERM
       and `ranges_left`. Store this mapping and the number
       representing current ERM version as `ranges_per_replica`.
   II. In parallel, for each tablet_replica, iterate through
       ranges_per_tablet_replica. Select independently up to two ranges
       that are still existing in ranges_left. Remove each range
       selected for processing from ranges_left. Before each iteration,
       verify that ERM version has not changed. If it has,
       return to Step I.

Steps I and II are exclusive to simplify maintaining `ranges_left` and
`ranges_per_replica`:
 - Step I iterates through `ranges_left` and creates
   `ranges_per_replica`
 - Step II iterates through `ranges_per_replica` and remove processed
   ranges from `ranges_left`

To maintain the exclusivity, the algorithm uses `parallel_for_each` in
Step II, requiring all ongoing `tablet_replica` processing to finish
before returning to Step I.

Currently, each node can handle any partition range, even if the
mapreduce supercoordinator does not retain the ERM and the range is
absent locally. This is because `execute_on_this_shard` creates a new
pager to coordinate the partition range read, including obtaining its
own ERM. However, absent ranges are handled by shard 0, so proper
routing is necessary to avoid overloading shard 0. Thus, in Step II,
the ERM is retained during each `tablet_replica` processing.

The tablet split scenario is not well-handled in this implementation.
After a split, the entire pre-split range is sent to a node hosting
the `tablet_replica` containing the range's `end_token`. The node
will typically not have other tablets in the range, and as
aforementioned, absent ranges are handled by shard 0. As a result,
in such scenario, shard 0 handles a significant portion of the range.
This issue is addressed later in this patch series by introducing
`shard_id` in `mapreduce_request`.

Ref. scylladb#21831
2025-06-25 10:18:02 +02:00
Kefu Chai
7d4dc12741 build: cmake: Use LINKER: prefix for consistent linker option handling
Previously, we passed dynamic linker options like "-dynamic-linker=..."
directly to the compiler driver with padded paths. This approach created
inconsistency with the build commands generated by `configure.py`.

This change implements a more consistent approach by:
- Using the CMake "LINKER:" prefix to mark options that should be passed
  directly to the linker
- Ensuring Clang properly receives these options via the `-Xlinker` flag

The result is improved consistency between CMake-generated build commands
and those created by `configure.py`, making the build system more
maintainable and predictable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23987
2025-06-25 11:17:15 +03:00
Nadav Har'El
16c1365332 test,alternator: test server-side load balancing with zero-token node
In issue #6527 it was suggested that a zero-token node (a.k.a coordinator-
only node, or data-less node) could serve as a topology-aware Alternator
load balancer - requests could be sent to it and they will be forwarded to
the right node.

This feature was implemented, but we never tested that it actually works
for Alternator requests. So this patch tests this by starting a 5-node
cluster with 4 regular nodes and one zero-token node, and testing that
requests to the zero-token node work as expected.

It is important to know that this feature does indeed work as expected,
and also to have a regression test for it so the feature doesn't break
in the future.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23114
2025-06-25 11:13:15 +03:00
Pablo Idiaquez
8137f34424 docs: troubleshooting/report-scylla-problem.rst: fix upload URL
wrong url / hostname pointing to deprecated S3 bucket
(we use GCP bucket now for uploads )

Fixes scylladb/scylladb#24639
Closes scylladb/scylladb#23533
2025-06-25 10:32:37 +03:00
Andrzej Jackowski
6d358cd7b2 storage_proxy: make storage_proxy::is_alive public
The motivation is to allow other components (specifically mapreduce
service) to use the method, just as storage_proxy::get_live_endpoints.
2025-06-25 08:59:04 +02:00
Andrzej Jackowski
9dbb1468b4 mapreduce: remove _shared_token_metadata from mapreduce_service
Before this change, `mapreduce_service` used `_shared_token_metadata`
to get the topology. However, the token was used in a part of the code
that already had its own ERM with its own metadata token. Moreover,
as mapreduce_service's token and ERM's token are not guaranteed to be
the same, inconsistencies could occur.

Therefore, this commit removes `_shared_token_metadata` and its usage.
2025-06-25 08:42:16 +02:00
Andrzej Jackowski
94ce5a0ed6 mapreduce: move dispatching logic to dispatch_to_vnodes
This commit moves the current dispatching logic of the mapreduce service
to a new dispatch_to_vnodes function. The moved code was written before
tablets were introduced, and although it works with tablets,
the variable naming still refers to vnodes (e.g., vnodes_per_addr,
vnodes_generator).

The motivation for this change is that later in this patch series,
a new algorithm for tablets is introduced, and both algorithms
need to coexist.

Ref. scylladb#21831
2025-06-25 08:42:03 +02:00
Andrzej Jackowski
48aced87f5 mapreduce: remove underscores from variable names
This commit removes unnecessary underscores from tr_state_ and
dispatcher_ variable names, that were left after moving code to
a separate function in the previous commit.
2025-06-25 08:41:21 +02:00
Andrzej Jackowski
d238a2f73e mapreduce: move req_with_modified_pr handling to a new function
The motivation for this change is to enable code reuse when
a new implementation of the mapreduce algorithm for tablets
is introduced later in this patch series.

Ref. scylladb#21831
2025-06-25 08:40:02 +02:00
Aleksandra Martyniuk
0deb9209a0 test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560
2025-06-25 09:08:06 +03:00
Botond Dénes
edc2906892 test/boost/sstable_datafile_test: add test for corrupt data
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
2025-06-25 08:41:29 +03:00
Botond Dénes
592ca789e2 sstables/mx/writer: handler rows with empty keys
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.

Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.
2025-06-25 08:41:29 +03:00
Botond Dénes
aae212a87c test/lib/cql_assertions: introduce columns_assertions
To enable targeted and optionally typed assertions against individual
columns in a row.
2025-06-25 08:41:29 +03:00
Botond Dénes
ebd9420687 sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
2025-06-25 08:41:26 +03:00
Botond Dénes
46ff7f9c12 tools/scylla-sstable: make large_data_handler a local
No reason for it to be a global, not even convenience.
2025-06-25 08:35:19 +03:00
Andrei Chekun
d81e0d0754 test.py: pytest c++ facades should respect saving logs on success
BostFacade and UnitFacade saving the logs only when test failed,
ignoring the -s parameter that should allow save logs on success. This
PR adding checking this parameter.

Closes scylladb/scylladb#24596
2025-06-24 20:53:32 +03:00
Botond Dénes
3e1c50e9a7 db: introduce corrupt_data_handler
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
  system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data
2025-06-24 14:57:00 +03:00
Botond Dénes
b931145a26 mutation: introduce frozen_mutation_fragment_v2
Mirrors frozen_mutation_fragment and shares most of the underlying
serialization code, the only exception is replacing range_tombstone with
range_tombstone_change in the mutation fragment variant.
2025-06-24 11:05:31 +03:00
Botond Dénes
64f8500367 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
Instead of mutation_fragment, let caller convert into mutation_fragment.
Allows reuse in future callers which will want to convert to
mutation_fragment_v2.
2025-06-24 11:05:31 +03:00
Botond Dénes
678deece88 mutation/mutation_partition_view: extract de-ser of {clustering,static} row
From the visitor in frozen_mutation_fragment::unfreeze(). We will want
to re-use it in the future frozen_mutation_fragment_v2::unfreeze().

Code-movement only, the code is not changed.
2025-06-24 11:05:31 +03:00
Botond Dénes
093d4f8d69 idl-compiler.py: generate skip() definition for enums serializers
Currently they only have the declaration and so far they got away with
it, looks like no users exists, but this is about to change so generate
the definition too.
2025-06-24 11:05:31 +03:00
Botond Dénes
b0d5462440 idl: extract full_position.idl from position_in_partition.idl
A future user of position_in_partition.idl doesn't need full_position
and so doesn't want to include full_position.hh to fix compile errors
when including position_in_partition.idl.hh.
Extract it to a separate idl file: it has a single user in a
storage_proxy VERB.
2025-06-24 11:05:30 +03:00
Botond Dénes
0753643606 db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.
2025-06-24 11:05:30 +03:00
Botond Dénes
92b5fe8983 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.
2025-06-24 11:05:30 +03:00
Abhinav Jha
5ff693eff6 group0: modify start_operation logic to account for synchronize phase race condition
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.

This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.

A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23536

Closes scylladb/scylladb#23829
2025-06-24 10:04:39 +02:00
Botond Dénes
bce89c0f5e sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.
2025-06-24 09:16:28 +03:00
Botond Dénes
27e26ed93f sstables/exceptions: introduce parse_assert()
To replace SCYLLA_ASSERT on the read/parse path. SSTables can get
corrupt for various reasons, some outside of the database's control. A
bad SSTable should not bring down the database, the parsing should
simply be aborted, with as much information printed as possible for the
investigation of the nature of the corruption.
The newly introduced parse_assert() uses on_internal_error() under the
hood, which prints a backtrace and optionally allows for aborting when
on the error, to generate a coredump.
2025-06-24 09:15:29 +03:00
Jenkins Promoter
b0a7fcf21b Update pgo profiles - aarch64 2025-06-23 19:20:50 +03:00
Jenkins Promoter
e15e5a6081 Update pgo profiles - x86_64 2025-06-23 19:20:50 +03:00
Marcin Maliszkiewicz
68ead01397 test: add test for live updates of generic server config
Affected config: uninitialized_connections_semaphore_cpu_concurrency
2025-06-23 17:56:26 +02:00
Marcin Maliszkiewicz
45392ac29e utils: don't allow do discard updateable_value observer
If the object returned from observe() is destructured,
it stops observing, potentially causing subtle bugs.
Typically, the observer object is retained as a class member.
2025-06-23 17:54:01 +02:00
Marcin Maliszkiewicz
c6a25b9140 generic_server: fix connections semaphore config observer
When temporary value returned by observer() is destructed it
disconnects from updateable_value so the code immediately stops
observing.

To fix it we need to retain the observer in the class object.
2025-06-23 17:54:01 +02:00
Patryk Jędrzejczak
6489308ebc Merge 'Introduce a queue of global topology requests.' from Gleb Natapov
Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously.

Fixes: #16822

No need to backport since this is a new feature.

Closes scylladb/scylladb#24293

* https://github.com/scylladb/scylladb:
  topology coordinator: simplify truncate handling in case request queue feature is disable
  topology coordinator: fix indentation after the previous patch
  topology coordinator: allow running multiple global commands in parallel
  topology coordinator: Implement global topology request queue
  topology coordinator: Do not cancel global requests in cancel_all_requests
  topology coordinator: store request type for each global command
  topology request: make it possible to hold global request types in request_type field
  topology coordinator: move alter table global request parameters into topology_request table
  topology coordinator: move cleanup global command to report completion through topology_request table
  topology coordinator: no need to create updates vector explicitly
  topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it
  topology coordinator: handle error during new_cdc_generation command processing
  topology coordinator: remove unneeded semicolon
  topology coordinator: fix indentation after the last commit
  topology coordinator: move new_cdc_generation topology request to use topology_request table for completion
  gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag
2025-06-23 16:08:09 +03:00
Aleksandra Martyniuk
9c3fd2a9df nodetool: repair: repair only vnode keyspaces
nodetool repair command repairs only vnode keyspaces. If a user tries
to repair a tablet keyspace, an exception is thrown.

Closes scylladb/scylladb#23660
2025-06-23 16:08:09 +03:00
Avi Kivity
52f11e140f tools: optimized_clang: make it work in the presence of a scylladb profile
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.

The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.

To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.

Fixes #22713

Closes scylladb/scylladb#24571
2025-06-23 16:08:09 +03:00
Botond Dénes
ab96c703ff mutation: check key of inserted rows
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.

The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.

Fixes: https://github.com/scylladb/scylladb/issues/24506
2025-06-23 09:38:45 +03:00
Botond Dénes
8b756ea837 compound: optimize is_full() for single-component types
For such compounds, unserializing the key is not necessary to determine
whether the key is full or not.
2025-06-23 09:38:45 +03:00
Nadav Har'El
85c19d21bb Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki
cql, schema: Extend name length limit from 48 to 192 bytes

    This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
    The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
    and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
    This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

    The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
    When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
    For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
    The directory name for this log table becomes the longest possible representation.
    Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
    To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
      255 bytes (common filesystem limit for a path component)
    -  32 bytes (for the 32-character UUID string)
    -   1 byte  (for the '-' separator)
    -  15 bytes (for the '_scylla_cdc_log' suffix)
    -  15 bytes (reserved for future use)
    ----------
    = 192 bytes (Maximum allowed name length)
    This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

    This patch also updates/adds all associated tests to validate the new 192-byte limit.
    The documentation has been updated accordingly.

Fixes #4480

Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.

Closes scylladb/scylladb#24500

* github.com:scylladb/scylladb:
  cql, schema: Extend name length limit from 48 to 192 bytes
  replica: Remove unused keyspace::init_storage()
2025-06-22 17:41:10 +03:00
Avi Kivity
770b91447b Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

Closes scylladb/scylladb#21638

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-22 11:19:25 +03:00
Michał Chojnowski
975e7e405a memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
2025-06-20 11:42:30 +02:00
Michał Chojnowski
7d551f99be replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
2025-06-20 11:42:30 +02:00
Łukasz Paszkowski
a9a53d9178 compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505
2025-06-20 11:33:49 +03:00
Nadav Har'El
70f5a6a4d6 test/cqlpy: fix run-cassandra script to ignore CASSANDRA_HOME
As test/cqlpy/README.md explains, the way to tell the run-cassandra
script which version of Cassandra should be run is through the
"CASSANDRA" variable, for example:

    CASSANDRA=$HOME/apache-cassandra-4.1.6/bin/cassandra \
    test/cqlpy/run-cassandra test_file.py::test_function

But all the Cassandra scripts, of all versions, have one strange
feature: If you set CASSANDRA_HOME, then instead of running the
actual Cassandra script you tried to run (in this case, 4.1.6), the
Cassandra script goes to run the other Cassandra from CASSANDRA_HOME!
This means that if a user happens to have, for some reason, set
CASSANDRA_HOME, then the documented "CASSANDRA" variable doesn't work.

The simple fix is to clear CASSANDRA_HOME in the environment that
run-cassandra passes to Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24546
2025-06-20 11:31:02 +03:00
Anna Stuchlik
17eabbe712 doc: improve the tablets limitations section
This PR improves the Limitations and Unsupported Features section
for tablets, as it has been confusing to the customers.

Refs https://github.com/scylladb/scylla-enterprise/issues/5465

Fixes https://github.com/scylladb/scylladb/issues/24562

Closes scylladb/scylladb#24563
2025-06-20 11:28:38 +03:00
Gleb Natapov
e364995e28 api: return error from get_host_id_map if gossiper is not enabled yet.
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.

Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.

Fixes: #24479

Closes scylladb/scylladb#24575
2025-06-20 11:27:28 +03:00
Andrei Chekun
392a7fc171 test.py: Fix the boost output file name
File name for the boost test do not use run_id, so each consequent run will
overwrite the logs from the previous one. If the first repeat fails, and the
second will pass, it overwrites the failed log. This PR allows saving the
failed one.

Closes scylladb/scylladb#24580
2025-06-20 11:26:16 +03:00
Asias He
c5a136c3b5 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561
2025-06-19 16:51:01 +03:00
Andrei Chekun
fcc2ad8ff5 test.py: Fix test result are overwritten
Currently, CI uses several nodes to execute the different modes to
reduce overall time for execution. During copying the results from nodes
to the main job test reports will be overwritten, since they are using
the same directory and the same name. This patch allows to
distinguishing these results and not overwrite them.

Closes scylladb/scylladb#24559
2025-06-19 16:51:01 +03:00
Pavel Emelyanov
dc166be663 s3: Mark claimed_buffer constructor noexcept
It just std::move-s a buffer and a semaphore_units objects, both moves
are noexcept, so is the constructor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24552
2025-06-18 20:36:45 +03:00
Avi Kivity
c89ab90554 Merge 'main: don't start maintenance auth service if not enabled' from Marcin Maliszkiewicz
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1

Closes scylladb/scylladb#24527

* github.com:scylladb/scylladb:
  test: add test for live updates of permissions cache config
  main: don't start maintenance auth service if not enabled
2025-06-18 20:28:53 +03:00
Karol Nowacki
4577c66a04 cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
  255 bytes (common filesystem limit for a path component)
-  32 bytes (for the 32-character UUID string)
-   1 byte  (for the '-' separator)
-  15 bytes (for the '_scylla_cdc_log' suffix)
-  15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
2025-06-18 14:08:38 +02:00
Karol Nowacki
a41c12cd85 replica: Remove unused keyspace::init_storage()
This function was declared but had no implementation or callers. It is being removed as minor code cleanup.
2025-06-18 14:08:38 +02:00
Petr Gusev
45f5efb9ba paxos_state: read repair for intranode_migration
A replica is not marked as 'pending' during intranode_migration.
The sp::get_paxos_participants returns the same set of endpoints
as before or after migration. No 'double quorum' means the replica
should behave as a single paxos acceptor. This is done by making
sure that the state on both shards is the same
when reading and repairing it before continuing if it is not.
2025-06-18 12:11:32 +02:00
Petr Gusev
583fb0e402 paxos_state: fix get_replica_lock for intranode_migration
Suppose a replica gets two requests at roughly the same time for
the same key. The requests are coming from two different LWT
coordinators, one is holding tablet_transition_stage::streaming erm,
another - tablet_transition_stage::write_both_read_new erm. The read
shard is different for these requests, so they don't wait each other in
get_replica_lock. The first request reads the state, the second request
does the whole RMW for paxos state and responds to its coordinator, then
the first request blindly overwrites the state -- the effects of the
second requst are lost.

In this commit we fix this problem by taking the lock on both shards,
starting from the smaller shard ID to the larger one, to avoid
deadlocks.
2025-06-18 12:11:32 +02:00
Petr Gusev
aa970bf2e4 sp::cas_shard: rename to get_cas_shard
We intend to introduce a separate cas_shard
class in the next commits. We rename the existing
function here to avoid conflicts.
2025-06-18 11:51:48 +02:00
Petr Gusev
85eac7c34e token_metadata_guard: a topology guard for a token
Data-plane requests typically hold a strong pointer to the
effective_replication_map (ERM) to protect against tablet
migrations and other topology operations. This works because
major steps in the topology coordinator use global barriers.
These barriers install a new token_metadata version on
each shard and wait for all references to the old one to
be dropped. Since the ERM holds a strong pointer to
token_metadata, it effectively blocks these operations
until it's released.

For LWT, we usually deal with a single token within a
single tablet. In such cases, it's enough to block
topology changes for just that one tablet. The existing
tablet_metadata_guard class already supports this: it tracks
tablet-specific changes and updates the ERM pointer
automatically, unless the change affects the guarded
tablet. However, this only works for tablet-aware tables.

To support LWT with vnodes (i.e., non-tablet-aware tables),
this commit introduces a new token_metadata_guard class.
It wraps tablet_metadata_guard when the table uses tablets,
and falls back to holding a plain strong ERM pointer otherwise.

In the next commits, we’ll migrate LWT to use token_metadata_guard
in paxos_response_handler instead of erm.
2025-06-18 11:51:48 +02:00
Petr Gusev
73221aa7b1 tablet_metadata_guard: mark as noncopyable and nonmoveable
tablet_metadata_guard passes a raw pointer to get_validity_abort_source,
so it can't be easily copied or moved. In this commit we make this
explicit.

We define destructor in cpp -- the autogenerated one complains on
lw_shared_ptr<replica::table> as replica::table is only
forward-declared in the headers.
2025-06-18 11:50:46 +02:00
Marcin Maliszkiewicz
dd01852341 test: add test for live updates of permissions cache config 2025-06-18 11:27:08 +02:00
Marcin Maliszkiewicz
97c60b8153 main: don't start maintenance auth service if not enabled
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
2025-06-18 11:27:08 +02:00
Botond Dénes
da1a3dd640 Merge 'test: introduce upgrade tests to test.py, add a SSTable dict compression upgrade test' from Michał Chojnowski
This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that.

In the series, we:
1. Mount `$XDG_CACHE_DIR` into dbuild.
2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of `$XDG_CACHE_DIR/scylladb/test.py`, and returns the path to `bin/scylla`.
3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test.
4. Add a test which uses the above to run an upgrade test between the released package and the current build.
5. Add `--run-internet-dependent-tests` to `test.py` which lets the user of `test.py` skip this test (and potentially other internet-dependent tests in the future).

(The patch modifying `wait_for_cql_and_get_hosts` is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.)

This is a follow-up to #23025, split into a separate PR because the potential addition of upgrade tests to `test.py` deserved a separate thread.

Needs backport to 2025.2, because that's where the tested feature is introduced.

Fixes #24110

Closes scylladb/scylladb#23538

* github.com:scylladb/scylladb:
  test: add test_sstable_compression_dictionaries_upgrade.py
  test.py: add --run-internet-dependent-tests
  pylib/manager_client: add server_switch_executable
  test/pylib: in add_server, give a way to specify the executable and version-specific config
  pylib: pass scylla_env environment variables to the topology suite
  test/pylib: add get_scylla_2025_1_executable()
  pylib/scylla_cluster: give a way to pass executable-specific options to nodes
  dbuild: mount "$XDG_CACHE_HOME/scylladb"
2025-06-18 12:21:21 +03:00
Benny Halevy
7c867b308f feature_service: never disable UUID_SSTABLE_IDENTIFIERS
The config option is unused since 6da758d74c

Refs #10459
Refs #20337

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
ecc7272a07 test: sstable_move_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
49ca442e7c test: sstable_directory_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
15bee9f232 sstables: sstable_generation_generator: set last_generation=0 by default
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
079c5fe5e3 test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
f0f7c83705 test: lib: test_env: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
0310a03de6 test: sstable_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
b00b805da6 test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
Which is `true` by default anyhow.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
f644c5896f test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
bfa0bb78f9 test: sstable_compaction_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Avi Kivity
2177ec8dc1 gdb: adjust unordered container accessors for libstdc++15
In libstdc++15, the internal structure of an unordered container
hashtable node changed from _M_storage._M_storage.__data to just
_M_storage._M_storage (though the layout is the same). Adjust
the code to work with both variants.

Closes scylladb/scylladb#24549
2025-06-18 09:15:03 +03:00
Michał Chojnowski
27f66fb110 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543
2025-06-17 19:30:50 +03:00
Anna Stuchlik
648d8caf27 doc: add support for z3 GCP
This commit adds support for z3-highmem-highlssd instance types to
Cloud Instance Recommendations for GCP.

Fixes https://github.com/scylladb/scylladb/issues/24511

Closes scylladb/scylladb#24533
2025-06-17 13:50:46 +03:00
Robert Bindar
1dd37ba47a Add dev documentation for manipulating s3 data manually
This patch intends to give an overview of where, when and how we store
data in S3 and provide a quick set of commands
which help gain local access to the data in case there is a need for
manual intervention.

The patch also collects in the same place links/descriptions for all
formats we use in S3.

Fixes #22438

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24323
2025-06-17 13:21:30 +03:00
Pavel Emelyanov
b0766d1e73 Merge 's3_client: Refactor range class for state validation' from Ernest Zaslavsky
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333

No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880

Closes scylladb/scylladb#24312

* github.com:scylladb/scylladb:
  s3_client: headers cleanup
  s3_client: Refactor `range` class for state validation
2025-06-17 10:34:55 +03:00
Ernest Zaslavsky
e398576795 s3_client: Fix hang in get() on EOF by signaling condition variable
* Ensure _get_cv.signal() is called when an empty buffer received
* Prevents `get()` from stalling indefinitely while waiting on EOF
* Found when testing https://github.com/scylladb/scylladb/pull/23695

Closes scylladb/scylladb#24490
2025-06-17 10:33:19 +03:00
Calle Wilund
4a98c258f6 http: Add missing thread_local specifier for static
Refs #24447

Patch adding this somehow managed to leave out the thread_local
specifier. While gnutls cert object can be shared across shards
just fine, the actual shared_ptr here cannot, thus we could
cause memory errors.

Closes scylladb/scylladb#24514
2025-06-17 10:23:52 +03:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Ernest Zaslavsky
1b20e0be4a s3_client: headers cleanup 2025-06-16 16:02:30 +03:00
Ernest Zaslavsky
9ad7a456fe s3_client: Refactor range class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
2025-06-16 16:02:24 +03:00
Pavel Emelyanov
5c2e5890a6 Merge 'test.py: Integrate pytest c++ test execution to test.py' from Andrei Chekun
With current changes, pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously.
Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py
Pytest executes all modes by itself, JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure.

**Breaking changes:**
1. Terminal output changed. test.py will run pytest for the next directories: `test/boost`, `test/ldap`, `test/raft`, `test/unit`. `test.py` will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, `test.py` will continue to show previous output for the rest of the test.
2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of `boost/aggregate_fcts_test` now you need to use `test/boost/aggregate_fcts_test.cc`
3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results, with differentiating them by mode and run.

**Note:**
Pytest uses pytest-xdist module to run tests in parallel. The Frozen toolchain has this dependency installed, for the local use, please install it manually.

Changes for CI https://github.com/scylladb/scylla-pkg/pull/4949. It will be merged after the current PR will be in master. Short disruption is expected, while PR in scylla-pkg will not be merged.

Fixes: https://github.com/scylladb/qa-tasks/issues/1777

Closes scylladb/scylladb#22894

* github.com:scylladb/scylladb:
  test.py: clean code that isn't used anymore
  test.py: switch off C++ tests from test.py discovery
  test.py: Integrate pytest c++ test execution to test.py
2025-06-16 16:01:37 +03:00
Pavel Emelyanov
0b6532a895 api: Shorten get_simple_states() handler
The one collects map<ip, state> then converts it to a jsonable vector of
helper objects with key and value members. This patch removes the
intermediate map and creates the vector instantly. With that change the
handler makes less data manipulations and behaves like the
get_all_endpoint_states one.

Very similar change was done in 12420dc644 with get_host_to_id_map
handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24456
2025-06-16 15:21:27 +03:00
Tomasz Grabiec
cdb1499898 Merge 'interval: reduce memory footprint' from Avi Kivity
The interval class's memory footprint isn't important for single objects,
but intervals are frequently held in moderately sized collections. In #3335 this
caused a stall. Therefore reducing interval's memory footprint and reduce
allocation pressure.

This series does this by consolidating badly-padded booleans in the object tree
spanned by interval into 5 booleans that are consecutive in memory. This
reduces the space required by these booleans from 40 bytes to 8 bytes.

perf-simple-query report (with refresh-pgo-profiles.sh for each measurement):

before:

252127.60 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37128 insns/op,   18147 cycles/op,        0 errors)
INFO  2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to   1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4)
246492.37 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37153 insns/op,   18411 cycles/op,        0 errors)
253633.11 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37127 insns/op,   17941 cycles/op,        0 errors)
254029.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37155 insns/op,   17951 cycles/op,        0 errors)
254465.76 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37123 insns/op,   17906 cycles/op,        0 errors)
throughput:
	mean=   252149.75 standard-deviation=3282.75
	median= 253633.11 median-absolute-deviation=1880.17
	maximum=254465.76 minimum=246492.37
instructions_per_op:
	mean=   37137.24 standard-deviation=15.71
	median= 37127.54 median-absolute-deviation=14.45
	maximum=37155.24 minimum=37122.79
cpu_cycles_per_op:
	mean=   18071.19 standard-deviation=212.25
	median= 17950.62 median-absolute-deviation=130.10
	maximum=18411.50 minimum=17906.13

after:

252561.26 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37039 insns/op,   18075 cycles/op,        0 errors)
256876.44 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37022 insns/op,   17785 cycles/op,        0 errors)
257084.38 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37030 insns/op,   17840 cycles/op,        0 errors)
257305.35 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37042 insns/op,   17804 cycles/op,        0 errors)
258088.53 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37028 insns/op,   17778 cycles/op,        0 errors)
throughput:
	mean=   256383.19 standard-deviation=2185.22
	median= 257084.38 median-absolute-deviation=922.16
	maximum=258088.53 minimum=252561.26
instructions_per_op:
	mean=   37032.17 standard-deviation=8.06
	median= 37030.46 median-absolute-deviation=6.44
	maximum=37041.83 minimum=37021.93
cpu_cycles_per_op:
	mean=   17856.60 standard-deviation=124.70
	median= 17804.16 median-absolute-deviation=71.24
	maximum=18075.50 minimum=17777.95

A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test.

Small performance improvement, so not a backport candidate.

Closes scylladb/scylladb#24232

* github.com:scylladb/scylladb:
  interval: reduce sizeof
  interval: change start()/end() not to return references to data members
  interval: rename start_ref() back to start() (and end_ref() etc).
  interval: rename start() to start_ref() (and end() etc).
  test: wrapping_interval_test: add more tests for intervals
2025-06-16 09:23:56 +02:00
Botond Dénes
898ce98500 db/batchlog_manager: remove unused member _total_batches_replayed
And its getter. There are no users for either.

Closes scylladb/scylladb#24416
2025-06-16 09:37:00 +03:00
Nadav Har'El
847d9c0911 alternator: update documentation that ttl with tablets does work
Our documentation docs/alternator/new-apis.md claims that Alternator TTL
does not work with tablets, due to issue #16567. However, we fixed that
issue in commit de96c28625. So let's drop
the outdated statement that it doesn't work.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24427
2025-06-16 09:36:11 +03:00
Ernest Zaslavsky
2b300c8eb9 s3_client: Improve reporting of S3 client statistics
Revise how we report statistics for `chunked_download_source`. Ensure
metrics for downloaded but unconsumed data are visible, as they do not
contribute to read amplification, which is tracked separately.

Closes scylladb/scylladb#24491
2025-06-16 09:33:57 +03:00
Pavel Emelyanov
9aaa33c15a Merge 'main.cc: fix group0 shutdown order' from Petr Gusev
Applier fiber needs local storage, so before shutting down local storage we need to make sure that group0 is stopped.

We also improve the logs for the case when `gate_closed_exception` is thrown while a mutation is being written.

Fixes [scylladb/scylladb#24401](https://github.com/scylladb/scylladb/issues/24401)

Backport: no backport -- not safe and the problem is minor.

Closes scylladb/scylladb#24418

* github.com:scylladb/scylladb:
  storage_service: test_group0_apply_while_node_is_being_shutdown
  main.cc: fix group0 shutdown order
  storage_proxy: log gate_closed_exception
2025-06-16 09:32:34 +03:00
Amnon Heiman
55b21b01ee alternator/stats.cc, metrics-config.yml: docs fix per-table metrics
This patch updates alternator/stats.cc and the get_description.py
configuration (metrics-config.yml) to restore compatibility with
per-table alternator metrics in the documentation generation process.

Previously, the group name for metrics was selected using an inline
expression like (has_table)? "alternator_table" : "alternator", which
made it difficult to maintain a straightforward mapping in the
configuration file.  With this change, the group name is now assigned to
a variable in alternator/stats.cc, allowing metrics-config.yml to map
group names directly. This makes the configuration easier to maintain
and enables get_description.py to document both global and per-table
metrics correctly.

This is a minimal, targeted fix to get the documentation working again
with the new per-table metrics format.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#24509
2025-06-15 18:06:36 +03:00
Jenkins Promoter
1b5eee6a12 Update pgo profiles - aarch64 2025-06-15 04:57:59 +03:00
Jenkins Promoter
e0c2d591c7 Update pgo profiles - x86_64 2025-06-15 04:44:13 +03:00
Avi Kivity
42d7ae1082 interval: reduce sizeof
An interval object stores five booleans: start()->is_inclusive(),
a boolean since start() itself is an std::optional, two more for
end(), and is_singular(). Due to bad packing, these five booleans
occupy 8 bytes each, for a total of 40 bytes.

Re-pack the interval class by storing those booleans explicitly
close by. Since we lose std::optional's ability to store
a maybe-constructed object, we re-implement it using anonymous
unions and therefore have to implement the 5 special methods.

This helps saves space when vectors of intervals are used, as
seen in #3335 for example.
2025-06-14 21:29:43 +03:00
Avi Kivity
f3dccc2215 interval: change start()/end() not to return references to data members
We'd like to change the data layout of `interval` to save space.
As a result, start() and end() which return references to data
members must return objects (not references). Since we'd like
to maintain zero-copy for these functions, we change them to
return objects containing references (rather than references
to objects), avoiding copying of potentially expensive objects.

We repurpose the interval_bound class to hold references (by
instantiating it with `const T&` instead of `T`) and provide
converting constructors. To make transform_bounds() retain
zero-copy, we add start() and end() that take *this by
rvalue reference.
2025-06-14 21:26:17 +03:00
Avi Kivity
16fb68bb5e interval: rename start_ref() back to start() (and end_ref() etc).
To reduce noise, rename start_ref() back to its original name start(),
after it was changed in the previous patch to force an audit of all calls.
2025-06-14 21:26:16 +03:00
Avi Kivity
3363bc41e2 interval: rename start() to start_ref() (and end() etc).
We are about to change start() to return a proxy object rather
than a `const interval_bound<T>&`. This is generally transparent,
except in one case: `auto x = i.start()`. With the current implementation,
we'll copy object referred to and assign it to x. With the planned
implementation, the proxy object will be assigned to `x`, but it
will keep referring to `i`.

To prevent such problems, rename start() to start_ref() and end()
to end_ref(). This forces us to audit all calls, and redirect calls
that will break to new start_copy() and end_copy() methods.
2025-06-14 21:26:16 +03:00
Avi Kivity
674118fd2e test: wrapping_interval_test: add more tests for intervals
In this series, we will make interval manage its memory directly,
specifically it will directly construct and destroy T values that
it contains rather than let std::optional<T> manage those values
itself.

Add tests that expose bugs encountered during development (actually,
review) of this series. The tests pass before the series, fail
with series as it was before fixing, and pass with the series as
it is now.

The tests use a class maybe_throwing_interval_payload that can
be set to throw at strategic locations and exercise all the interesting
interval shapes.
2025-06-14 21:26:14 +03:00
Patryk Jędrzejczak
c4cf95aeb3 Merge 'raft: simplify voter handler code to not pass node references around' from Emil Maskovsky
Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation.

This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed.

Additional cleanup has been done:
* removing redundant comments (that just repeat what the code does)
* use explicit comparators for the datacenter and rack information priorities (instead of the comparison operator) to be more explicit about the prioritization

Fixes: scylladb/scylladb#24035

No backport: This change does not fix any bug and doesn't change the behavior, just cleans up the code in master, therefore no backport is needed.

Closes scylladb/scylladb#24452

* https://github.com/scylladb/scylladb:
  raft: simplify voter handler code to not pass node references around
  raft: reformat voter handler for consistent indentation
  raft: use explicit priority comparators for datacenters and racks
  raft: clean up voter handler by removing redundant comments
2025-06-13 19:02:07 +02:00
Anna Stuchlik
e2b7302183 doc: extend 2025.2 upgrade with a note about consistent topology updates
This commit adds a note that the user should enable consistent topology updates before upgrading
to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1.

Fixes https://github.com/scylladb/scylladb/issues/24467

Closes scylladb/scylladb#24468
2025-06-13 13:54:59 +03:00
Piotr Dulikowski
238fc24800 Merge 'test: dtest: move audit_test.py to test.py' from Andrzej Jackowski
Copied the entire audit_test.py from scylladb/scylla-dtest, to remove the entire file from scylla-dtest after this patch series is merged. The motivation is to move entire audit testing to from dtests, to make it easier to maintain and more reliable.

After audit_test.py was moved from dtests to test.py, some issues that require fixing arose due to differences between the frameworks.

No backport, moving audit_test.py to test.py is a new testing effort.

Closes scylladb/scylladb#24231

* github.com:scylladb/scylladb:
  test: audit: filter out LOGIN and USE audit logs
  test: audit: remove require mark
  test: audit: wait until raft state is applied in test_permissions
  test: audit: fix problems in audit_test.py
  test: dtest: add dict support to populate in scylla_cluster.py
  test: dtest: copied get_node_ip from dtests to scylla_cluster.py
  test: dtest: copy run_rest_api from dtests to cluster.py
  test: dtest: copy run_in_parallel from dtests to data.py
  test: audit: copy unmodified audit_test.py from dtests
2025-06-12 09:03:45 +02:00
Andrei Chekun
570aaa2ecb test.py: clean code that isn't used anymore
Clean code that is not used anymore
2025-06-11 18:29:26 +02:00
Andrei Chekun
9dca7719b1 test.py: switch off C++ tests from test.py discovery
Switch off C++ tests from test.py discovery. With this change, test.py loses
the ability to directly see and run the C++ tests. Instead, it'll delegate all
things to the pytest.
Since boost, raft, unit, and ldap directories aren't executed by test.py,
suite.yaml files are renamed to test_config.yaml
to preserve the old way of test configuration and removing them from execution
by test.py
Before this patch boost test were visible by test.py and pytest. So if the
test.py will be invoked without test name, it will execute boost tests twice:
with test.py executor and with pytest executor. Depending on the test name
according executor will be used. For example, if test name is
test/boost/aggregate_fcts_test.cc it will be executed by pytest, but if the
boost/aggregate_fcts_test it will be executed by test.py executor.
2025-06-11 18:29:26 +02:00
Andrei Chekun
42d9dbe66a test.py: Integrate pytest c++ test execution to test.py
With current changes pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade
to have the possibility to get them for C++ test as previously.
Since pytest executes all modes by itself JUnit report for the C++ test will be one for the run. That means that there
is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds
of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in
testlog/report/allure.
**Breaking changes: **
1. Terminal output changed. test.py will run pytest for next directories: test/boost, test/ldap, test/raft, test/unit.
test.py will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, test.py
will continue to show previous output for the rest of the test.
2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the
file with extension. For example, instead of boost/aggregate_fcts_test now you need to use test/boost/aggregate_fcts_test.cc
3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs
and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result.
Now it shows nine results with differentiating them by mode and run.

Note:
Pytest uses pytest-xdist module to run tests in parallel. Frozen toolchain has this dependency installed, for the local
use, please install it manually.
2025-06-11 18:29:23 +02:00
Tomasz Grabiec
eabc1fa6ff Merge 'tablets: deallocate storage state on end_migration' from Michael Litvak
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

Closes scylladb/scylladb#24393

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-11 17:37:02 +02:00
Aleksandra Martyniuk
83c9af9670 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.
2025-06-11 16:17:39 +02:00
Aleksandra Martyniuk
df152d9824 repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.
2025-06-11 15:38:43 +02:00
Gleb Natapov
c00a0554e0 topology coordinator: simplify truncate handling in case request queue feature is disable
After allowing running multiple command in parallel the code that
handles multiple truncates to the same table can be simplified since
now it is executed only if request queue feature is disable, so it does
not need to handle the case where a request may be in the queue.
2025-06-11 11:29:33 +03:00
Gleb Natapov
01dd4b7f30 topology coordinator: fix indentation after the previous patch 2025-06-11 11:29:33 +03:00
Gleb Natapov
a9e99d1d3c topology coordinator: allow running multiple global commands in parallel
Now that we have a global request queue do not check that there is
global request before adding another one. Amend truncation test that
expects it explicitly and add another one that checks that two truncates
can be submitted in parallel.
2025-06-11 11:29:33 +03:00
Gleb Natapov
a0a3a034e0 topology coordinator: Implement global topology request queue
Requests, together with their parameters, are added to the
topology_request tables and the queue of active global requests is
kept in topology state. Thy are processed one by one by the topology
state machine.

Fixes: #16822
2025-06-11 11:29:33 +03:00
Andrzej Jackowski
e23d79cb62 test: audit: filter out LOGIN and USE audit logs
LOGIN entries can appear at many points during testing, for example,
when a driver creates a new session. Similarly, `USE ks` statements
can appear unexpectedly, especially when the python-driver calls
`set_keyspace_async` for new connections.

To avoid test checks failures,
this commit filters out LOGIN and USE entries in tests that are
not intended to verify these two types of audit logs.
2025-06-11 09:43:51 +02:00
Andrzej Jackowski
876eaf459b test: audit: remove require mark
After moving audit tests to dtests, require marks are no longer
needed because the tests and the code are in the same repository.
2025-06-11 09:43:51 +02:00
Marcin Maliszkiewicz
111cccf8ba test: audit: wait until raft state is applied in test_permissions
Otherwise test is flaky, expecting permissions to be
enforced before they get applied.
2025-06-11 09:43:51 +02:00
Andrzej Jackowski
6c6234979c test: audit: fix problems in audit_test.py
After audit_test.py was moved from dtests to test.py, the
following issues arose due to differences between the frameworks:
  - Some imports were unnecessary or broken
  - The @pytest.mark.dtest_full decorator was no longer needed
  - The `issue_open` attribute in `xmark` is not supported
  - Support for sending SIGHUP is encapsulated
    by `server_update_config` in test.py`
  - A workaround for scylladb#24473 was required

Moreover, suite.yaml was changed to start running audit_test.py
in dev mode.

Ref. scylladb#24473

Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2025-06-11 09:43:44 +02:00
Michał Chojnowski
0ade15df33 transport/server: silence the oversized allocation warning in snappy_compress
It has been observed to generate ~200 kiB allocations.

Since we have already been made aware of that, we can silence the warning
to clean up the logs.

Closes scylladb/scylladb#24360
2025-06-10 19:13:26 +03:00
Petr Gusev
b1050944a3 storage_service: test_group0_apply_while_node_is_being_shutdown 2025-06-10 17:25:03 +02:00
Petr Gusev
6b85ab79d6 main.cc: fix group0 shutdown order
group0 persistence relies on local storage, so before
shutting down local storage we need to make sure that
group0 is stopped.

Fixes scylladb/scylladb#24401
2025-06-10 16:06:22 +02:00
Wojciech Mitros
5eb4466789 Return correct creation date time in describe table
Add system:table_creation_time tag with value - timestamp in milliseconds of creation table.
If the tag is present, it will used to fill creation timestamp value (when CreateTable or DescribeTable is called).
If the tag is missing, value 0 for timestamp will be substituted (in other words table was created on 1th january of 1970).
Update test to change how we make sure timestamp is actually used - we create two tables one after another and make sure their creation timestamp is in correct order.
Update tests, that work with tags to filter system tags out.

Fixes #5013

Closes scylladb/scylladb#24007
2025-06-10 15:25:57 +03:00
Nadav Har'El
ed3a0a81d6 test/cqlpy: add some more tests of secondary index system tables
This patch adds a couple of basic tests for system tables related to
secondary indexes - system."IndexInfo" and system_schema.indexes.

I wanted to understand these system tables better when writing
documentation for them - so wrote these tests. These tests can also
serve as regression tests that verify that we don't accidentally lose
support for these system tables. I checked that these tests also pass
in Cassandra 3, 4 and 5.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24137
2025-06-10 15:00:51 +03:00
Tomasz Grabiec
0b516da95b Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single  `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649

Closes scylladb/scylladb#20853

* github.com:scylladb/scylladb:
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-06-10 13:45:32 +02:00
Ernest Zaslavsky
30199552ac s3_client: Mitigate connection exhaustion in download_source
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.

On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.

Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:

- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
  the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
  consumed from the queue

Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.

Fixes: https://github.com/scylladb/scylladb/issues/23785

Closes scylladb/scylladb#23880
2025-06-10 12:58:24 +03:00
Anna Stuchlik
b0ced64c88 doc: remove the limitation for disabling CDC
This commit removes the instruction to stop all writes before disabling CDC with ALTER.

Fixes https://github.com/scylladb/scylla-docs/issues/4020

Closes scylladb/scylladb#24406
2025-06-10 12:53:09 +03:00
Robert Bindar
ca1a9c8d01 Add support for nodetool refresh --skip-reshape
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24409
Fixes: #24365
2025-06-10 12:52:13 +03:00
David Garcia
62fdebfe78 chore: exclude OS and ENT from google
Closes scylladb/scylladb#24417
2025-06-10 12:50:37 +03:00
Emil Maskovsky
b7e0a01fcc raft: simplify voter handler code to not pass node references around
Refactor the voter handler logic to only pass around node IDs
(`raft::server_id`), instead of pairs of IDs and node descriptor
references. Node descriptors can always be efficiently retrieved from
the original nodes map, which remains valid throughout the calculation.

This change reduces unnecessary reference passing and simplifies
the code. All node detail lookups are now performed via the central
nodes map as needed.

Fixes: scylladb/scylladb#24035
2025-06-10 11:04:56 +02:00
Emil Maskovsky
839e0bf40d raft: reformat voter handler for consistent indentation
Reformatted the voter handler implementation to comply with clang-format
automatic formatting rules. No functional changes.
2025-06-10 11:04:56 +02:00
Emil Maskovsky
05392e6ef3 raft: use explicit priority comparators for datacenters and racks
Refactor the voter handler to use explicit priority comparator classes
for datacenter and rack selection. This makes the prioritization logic
more transparent and robust, and reduces the risk of subtle bugs that
could arise from relying on implicit comparison operators.
2025-06-10 11:04:54 +02:00
Emil Maskovsky
e93bf3f05a raft: clean up voter handler by removing redundant comments
Remove comments from the group0 voter handler that simply restate
the code or do not provide meaningful clarification. This improves
code readability and maintainability by reducing noise and focusing
on essential documentation.
2025-06-10 11:03:20 +02:00
Calle Wilund
80feb8b676 utils::http::dns_connection_factory: Use a shared certificate_credentials
Fixes #24447

This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.

Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.

This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.

This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.

Closes scylladb/scylladb#24448
2025-06-10 11:20:21 +03:00
Petr Gusev
e456d2d507 storage_proxy: log gate_closed_exception
gate_closed_exception likely signals that we have shutdown order
issues. If we just swallow it we lose information what
exact component was shutdown prematurely.

For example, we stopped local storage before group0 during shutdown
in main.cc. If a group0 command arrives, topology_state_load might
try to write something and get mutation_write_failure_exception,
which results in 'applier fiber stopped because of the error'.
There is no other information in the logs in this case, other
than 'mutation_write_failure_exception'. It's not clear what the
original problem is and what component is triggering it.

In this commit we add a warning to the logs when gate_closed_exception
is thrown from lmutate or rmutate.

Another option is to just remove the try_catch_nested line and allow
gate_closed_exception to be logged as an error below. However,
this might break some tests which check ERROR lines in the logs.
2025-06-10 10:04:04 +02:00
Andrzej Jackowski
c4e8a2c44e mapreduce: change next_vnode lambda to get_next_partition_range function
The motivation of this code reorganization is to shorten
the time when ERM is being kept, done later in this patch series.

Ref. scylladb#21831
2025-06-10 09:06:17 +02:00
Michael Litvak
bd88ca92c8 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.
2025-06-09 17:27:45 +03:00
Michael Litvak
fb18fc0505 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.
2025-06-09 16:59:07 +03:00
Michael Litvak
34f15ca871 tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481
2025-06-09 16:58:38 +03:00
Michael Litvak
8aeb404893 test_cdc_generation_clearing: wait for generations to propagate
In test_cdc_generation_clearing we trigger events that update CDC
generations, verify the generations are updated as expected, and verify
the system topology and CDC generations are consistent on all nodes.

Before checking that all nodes are consistent and have the same CDC
generations, we need to consider that the changes are propagated through
raft and take some time to propagate to all nodes.

Currently, we wait for the change to be applied only on the first server
which runs the CDC generation publisher fiber and read the CDC
generations from this single node. The consistency check that follows
could fail if the change was not propagated to some other node yet.

To fix that, before checking consistency with all nodes, we execute a
read barrier on all nodes so they all see the same state as the leader.

Fixes scylladb/scylladb#24407

Closes scylladb/scylladb#24433
2025-06-09 12:59:04 +02:00
Gleb Natapov
bb29591daf topology coordinator: Do not cancel global requests in cancel_all_requests
This was mistakenly added by fbd75c5c06.
The function is called after checking that no topology request can
proceed, so it cancels them, but this has nothing to do with global
request. Also, for some reason, the cancellation was added in the loop
over topology requests.
2025-06-09 13:38:49 +03:00
Gleb Natapov
be0b328b19 topology coordinator: store request type for each global command 2025-06-09 13:38:49 +03:00
Gleb Natapov
00fd427be0 topology request: make it possible to hold global request types in request_type field
topology_request table has a filed to hold a request type, but
currently it can hold only per node requests. This patch makes it
possible to store global request types there as well.
2025-06-09 13:38:49 +03:00
Gleb Natapov
3a496067c6 topology coordinator: move alter table global request parameters into topology_request table
Currently parameters to alter table global topology command are stored
in static column in the topology table, but this way there can be only one
outstanding alter table request. This patch moves the parameters to
the topology_request table where parameters are stored per request.
2025-06-09 13:38:49 +03:00
Gleb Natapov
a9244bf037 topology coordinator: move cleanup global command to report completion through topology_request table
We want to unify all command to report completion through the
topology_requests table.
2025-06-09 13:38:49 +03:00
Gleb Natapov
6a52ba2251 topology coordinator: no need to create updates vector explicitly 2025-06-09 13:38:49 +03:00
Gleb Natapov
69dacb5894 topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it 2025-06-09 13:38:49 +03:00
Gleb Natapov
7257391c8f topology coordinator: handle error during new_cdc_generation command processing
Currently if there is an error during new_cdc_generation command it is
retried in a loop.  Since the status of the command executing is now
reported through the topology request table we can fail the command
instead,
2025-06-09 13:38:48 +03:00
Gleb Natapov
389f0f6280 topology coordinator: remove unneeded semicolon 2025-06-09 13:38:48 +03:00
Gleb Natapov
ba371c09fc topology coordinator: fix indentation after the last commit 2025-06-09 13:38:48 +03:00
Gleb Natapov
b8c11f330a topology coordinator: move new_cdc_generation topology request to use topology_request table for completion
Currently it checks the completion by waiting for new generation to
appear, but we want to unify all commands to check for completion in
topology_request table.
2025-06-09 13:38:48 +03:00
Gleb Natapov
6d09c76a12 gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag
Will be needed to coordinate between old and new nodes during upgrade.
2025-06-09 13:38:48 +03:00
Anna Stuchlik
93a7146250 doc: add redirections to fix 404
This commit adds redirections for pages on the master branch
that were unexpectedly indexed by Google.
Those pages no longer exist and return 404.

Fixes https://github.com/scylladb/scylladb/issues/24397

Closes scylladb/scylladb#24422
2025-06-09 12:38:10 +02:00
Pavel Emelyanov
46557b3927 table: Touch and sync snapshot directory only once
The table::take_snapshot() touches the snapshot directory, which is
good. It happens on all shards, which is not that good, because all
shards just step on each other toes when doing it, the directory is not
sharded. Same for post-snapshot directory sync -- it can happen once,
after all shards finish creating snapshot links.

Move both, touching and syncing up one level. There's only one caller of
the method, so only one caller to update.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24154
2025-06-09 13:36:49 +03:00
Michał Chojnowski
7d26d3c7cb db/config: add an option that disables dict-aware sstable compressors in DDL statements
For reasons, we want to be able to disallow dictionary-aware compressors
in chosen deployments.

This patch adds a knob for that. When the knob is disabled,
dictionary-aware compressors will be rejected in the validation
stage of CREATE and ALTER statements.

Closes scylladb/scylladb#24355
2025-06-09 13:30:40 +03:00
Raphael S. Carvalho
2d716f3ffe replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426
2025-06-08 15:59:15 +03:00
Nadav Har'El
a714079a62 Merge 'Add Support for Per-Table Metrics in Alternator' from Amnon Heiman
This series introduces per-table metrics support for Alternator. It includes the following commits:

Add optional per-table metrics for Alternator
Introduces a shared_ptr-based mechanism that allows Alternator to register per-table metrics. These metrics follow the table's lifecycle, similar to how CQL metrics are handled. The use of shared_ptr ensures no direct dependency between table stats and Alternator.

Enable registration of stats objects per table
Adds support for registering a stats object using a keyspace and table name. Per-table metrics are prefixed with alternator_table to differentiate them from per-shard metrics. Metrics are reported once per node, and those not meaningful at the table level (e.g. create/delete) are excluded. All metrics use the skip_when_empty flag.

Update per-table metrics handling
Adds a helper function to retrieve the stats object from a table schema. Updates both per-shard and per-table metrics, resulting in some code duplication.

Add tests for per-table metrics
Extends existing tests to also validate the per-table metrics. These tests ensure that the new metrics are correctly registered and updated.

This series improves observability in Alternator by enabling fine-grained per-table metrics without disrupting existing per-shard metrics.
**No need to backport**

Fixes #19824

Closes scylladb/scylladb#24046

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: Test the per-table metrics
  alternator/executor.cc: Update per-table metrics
  alternator/stats: Add per-table metrics
  replica/database.hh: Add alternator per-table metrics
  alternator/stats.hh: Introduce a per-table stats container
2025-06-08 10:42:05 +03:00
Botond Dénes
8498bd6376 Merge 'Replace container_to_vec with std::ranges' from Pavel Emelyanov
The helper in question converts an iterable collection to a vector of fmt::to_string()-s of the collection elements.
Patch the caller to use standard library and remove the helper.

Closes scylladb/scylladb#24357

* github.com:scylladb/scylladb:
  api: Drop no longer used container_to_vec helper
  api: Use std::ranges to stringify collections
  api: Use std::ranges to convert std::set<sstring> to std::vector<string>
  api: Use db::config::data_file_directories()' vector directly
  api: Coroutinize get_live_endpoint()
2025-06-06 10:57:06 +03:00
Pavel Emelyanov
12420dc644 api: Shorten get_host_to_id_map() handler
The handler does

- gets host IDs from local token metadata
- for each ID gets the host IP and generates IP:ID std::pair
- converts the sequence of generated pairs into std::unordered_map
- converts the unordered map into vector of jsonable key:value objects

This patch removes the 3rd step and makes the needed jsonable object in
step 2 directly, thus eliminating the interposing unordered_map
creation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24354
2025-06-06 10:54:23 +03:00
Pavel Emelyanov
428edd41f5 api: Make us of datablse::get_all_keyspaces()
There are two places in the API that want to get the list of keyspace
names. For that they call database::get_keyspaces() and then extract
keys from the returned name to class keyspace map.

There's a database::get_all_keyspaces() method that does exactly that.

Remove the map_keys helper from the api/api.hh that becomes unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24353
2025-06-06 10:53:09 +03:00
Marcin Maliszkiewicz
2090e44283 storage_service: always wake up load balancer on update tablet metadata
Lack of wakeup is error-prone, as it relies on a wakeup occurring
elsewhere.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
ddc0656eb5 db: schema_applier: call destroy also when exception occurs
Otherwise objects may be destroyed on wrong shard, and assert
will trigger in ~sharded().
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
547bb1f663 db: replica: simplify seeding ERM during shema change
We know that caller is running on shard 0 so we can avoid some extra boilerplate.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
97cdb72d4d db: remove cleanup from add_column_family
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.

Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
d5075c70ef db: abort on exception during schema commit phase
As we have no way to recover from partial commit.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
858db822dc db: make user defined types changes atomic
The same order of creation/destruction is preserved as in the
original code, looking from single shard point of view.

create_types() is called on each shard separately, while in theory
we should be able reuse results similarly as diff_rows(). But we
don't introduce this optimization yet.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
5b2e4140cc replica: db: make keyspace schema changes atomic
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).

In the future we'll extend this to the whole schema
and also other subsystems.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
556e89bc9d db: atomically apply changes to tables and views
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
  functions

Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
a27776b4ff replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
ac254e9722 service: split update_tablet_metadata into two phases
In following commits calls will be split in schema_applier.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
21a5a3c01f service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
92e3d69f79 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
1c8fd3a65d service: simplify load_tablet_metadata and update_tablet_metadata
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)

- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
3119a02edd db: don't perform move on tablet_hint reference
This lambda is called several times so there should be no move.
Currently the bug likely doesn't manifest as code does work
only on shard 0.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
1ad14f02f1 replica: split add_column_family_and_make_directory into steps
This is similar work as for drop_table in previous commit.

add_column_family_and_make_directory() behaves exactly the same
as before but calls to it in schema_applier will be replaced by
calls directly to split steps. Other usages will remain intact as
they don't need atomicity (like creating system tables at startup).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
141a5643e5 replica: db: split drop_table into steps
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.

We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()

prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.

We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.

Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
2bae38e252 db: don't move map references in merge_tables_and_views()
Since they are const it's not needed and misleading.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
85f19e165a db: introduce commit_on_shard function
This will be the place for all atomic schema switching
operations.

Note that atomicity is observed only from single shard
point of view. All shards may switch at slightly different times
as global locking for this is not feasible.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
b3730282c3 db: access types during schema merge via special storage
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
2025-06-06 08:50:33 +02:00
Pavel Emelyanov
f5743c6afc Merge 'test/alternator: make tests runnable on DynamoDB Local' from Nadav Har'El
The Alternator tests should pass on Alternator (of course), and almost always also on DynamoDB to verify that the tests themselves are correct and don't just enshrine Alternator's incorrect behavior. Although much less important, it is sometimes useful to be able to check if the test also pass on other DynamoDB clones, especially "DynamoDB Local" - Amazon's DynamoDB mock written in Java.

In issue https://github.com/scylladb/scylladb/issues/7775 we noted that some of our tests don't actually pass on DynamoDB Local, for different reasons, but at the time that issue was created most of the tests did work. However, checking now on a newer version of DynamoDB Local (2.6.1), I notice that _all_ tests failed because of some silly reasons that are easy to fix - and this is what the two patches in this series fix. After these fixes, most of the Alternator tests pass on DynamoDB Local. But not all of them - #7775 is still open.

No backport needed - these are just test framework improvements for developers.

Closes scylladb/scylladb#24361

* github.com:scylladb/scylladb:
  test/alternator: any response from healthcheck means server is alive
  test/alternator: fall back to legal-looking access key id
2025-06-06 08:50:58 +03:00
Nadav Har'El
b0f98f7d4b mv: test that view's SELECT automatically includes primary key
Both ScyllaDB's and Datastax's documentation suggest that when creating a
view with CREATE MATERIALIZED VIEW, its SELECT clause doesn't need to list
the view's primary key columns because those are selected automatically.
For example, our documentation has an example in
https://docs.scylladb.com/manual/stable/features/materialized-views.html

```
CREATE MATERIALIZED VIEW building_by_city2 AS
        SELECT meters FROM buildings
        WHERE city IS NOT NULL
        PRIMARY KEY(city, name);
```

Note how the primary key columns - city and name - are not explicitly
SELECTed.

I just discovered that while this behavior was indeed true in Cassandra
3 (and still true in ScyllaDB), it actually got broken in Cassandra 4 and 5.
I reported this apprent regression to Cassandra (CASSANDRA-20701), and
proposing the regression test in this patch to ensure that Scylla can't
suffer a similar regression in the future.

The new test passes on ScyllaDB and Cassandra 3, but fails on Cassandra
4 and 5 (and therefore tagged with "cassandra_bug").

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24399
2025-06-05 16:52:49 +02:00
Piotr Szymaniak
de96c28625 alternator: Add support for TTL when using tablets
Support for TTL-based data removal when using tablets.
The essence of this commit is a separate code path for finding token
ranges owned by the current shard for the cases when tablets are used
and not vnodes. At the same time, the vnodes-case is not touched not to
cause any regressions.

The TTL-caused data removal is normally performed by the primary
replica (both when using vnodes and tablets). For the tablets case,
the already-existing method tablet_map::get_primary_replica(tablet_id)
is used to know if a shard execuring the TTL-related data removal is
the primary replica for each tablet.

A new method tablet_map::get_secondary_replica(tablet_id) has been
added. It is needed by the data invalidation procedure to remove data
when the primary replica node is down - the data is then removed by the
secondary replica node. The mechanism is the same as in the vnodes case.

Since alternator now supports TTL, the test
`test_ttl_enable_error_with_tablets` has been removed.
Also, tests in the test_ttl.py have been made to run twice, once with
vnodes and once with tablets. When run with tablets, the due to lack of
support for LWT with tablets (#18068), tests use
'system:write_isolation' of 'unsafe_rmw'. This approach allows early
regression testing with tablets and is meant only as a tentative
solution.

Fixes scylladb/scylladb#16567

Closes scylladb/scylladb#23662
2025-06-05 17:39:29 +03:00
Amnon Heiman
760c8c3333 alternator/test_metrics.py: Test the per-table metrics
This patch adds tests for the newly added per-table metrics. It mainly
redoes existing tests, but verifies that the per-table metrics are
updated correctly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 15:12:19 +03:00
Amnon Heiman
3ad7a24eee alternator/executor.cc: Update per-table metrics
This patch adds support for updating per-table metrics. It introduces a
helper function that retrieves the stats object from a table schema.
The code uses a lw_shared_ptr for the stats object to ensure safe updates
even if the table holding it has been deleted.

There is some duplication in the updated code, as both per-shard and
per-table metrics are updated.

The rmw_operation::execute function now accepts two stats objects: one
for the global metrics and one for the per-table metrics.  The use of
execute was also modified—rather than modifying the WCU directly, a
parameter is used so both global and per-table stats can be updated.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 15:12:13 +03:00
Amnon Heiman
d6afd42342 alternator/stats: Add per-table metrics
This patch allows registering a stats object per table. The per-table
stats object needs its metrics registry to be part of the table's
lifecycle, but there could be a scenario in which a table is already
deleted while some Alternator operations are still in progress.  To
handle this, the patch separates the registry from the metrics holder.
It is safe to modify a parameter that is not registered.

Metrics registration is performed via functions instead of the
constructor.

The registration accepts a keyspace and table name as parameters.

The per-table metrics use an alternator_table prefix to distinguish them
from their per-shard equivalents.

The metrics are aggregated and reported once per node.  Metrics that do
not make sense to report per table (such as create and delete) are not
registered. All metrics are marked with skip_when_empty.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 14:44:03 +03:00
Amnon Heiman
005df0c5c4 replica/database.hh: Add alternator per-table metrics
This patch adds optional per-table metrics for Alternator.

Like CQL, some of Alternator's statistics should be per-table. The
shared_ptr allows Alternator to register such metrics in a way that
makes them part of the table's lifecycle.

Using a shared_ptr does not create dependencies between the table_stats
and Alternator.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 14:38:14 +03:00
Amnon Heiman
af262317b5 alternator/stats.hh: Introduce a per-table stats container
A per-table stats container will be used to safely hold alternator
per-table stats.

It is build in a way that even if the metrics it holds are no longer
registered, it is still safe to use.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-06-05 14:38:14 +03:00
Andrzej Jackowski
e6eb741e95 test: dtest: add dict support to populate in scylla_cluster.py
Co-authored-by: Evgeniy Naydanov <evgeniy.naydanov@scylladb.com>
2025-06-05 08:20:09 +02:00
Andrzej Jackowski
e3f052d6fb test: dtest: copied get_node_ip from dtests to scylla_cluster.py
Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2025-06-05 08:20:09 +02:00
Andrzej Jackowski
40e71ad1e6 test: dtest: copy run_rest_api from dtests to cluster.py
Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2025-06-05 08:20:09 +02:00
Andrzej Jackowski
3da86f04a5 test: dtest: copy run_in_parallel from dtests to data.py
Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2025-06-05 08:19:54 +02:00
Andrzej Jackowski
a1b1d810f9 test: audit: copy unmodified audit_test.py from dtests
Copied the entire audit_test.py from scylladb/scylla-dtest, to remove
the entire file from scylla-dtest after this patch series is merged.
The motivation is to move entire audit testing to from dtests,
to make it easier to maintain and more reliable.

Changed suite.yaml, to prevent audit_test.py from running because
audit_test.py needs improvement before it starts passing.

Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2025-06-05 08:19:44 +02:00
Ernest Zaslavsky
a39b773d36 encryption_test: Catch exact exception
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.

Start catching the exact exception that we expect to be thrown.

Maybe somewhat related to https://github.com/scylladb/scylladb/issues/22628

Fixes: https://github.com/scylladb/scylladb/issues/24145

reapplies reverted: https://github.com/scylladb/scylladb/pull/24065

Should be backported to 2025.2.

Closes scylladb/scylladb#24242
2025-06-05 08:32:51 +03:00
Benny Halevy
8b387109fc disk_space_monitor: add space_source_registration
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.

Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.

Fixes #24314

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24342
2025-06-04 16:25:24 +03:00
Ernest Zaslavsky
1446f57635 minio: update CLI usage, remove deprecated mc options
Replace phased-out `mc` command options with supported alternatives.
Ensures compatibility with the latest MinIO version.

Closes scylladb/scylladb#24363
2025-06-04 16:22:48 +03:00
Anna Stuchlik
8b989d7fb1 doc: add the upgrade guide from 2025.1 to 2025.2
This commit adds the upgrade guide from version 2025.1 to 2025.2.
Also, it removes the upgrade guides existing for the previous version
that are irrelevant in 2025.2 (upgrade from OSS 6.2 and Enterprise 2024.x).

Note that the new guide does not include the "Enable Consistent Topology Updates" page,
as users upgrading to 2025.2 have consistent topology updates already enabled.

Fixes https://github.com/scylladb/scylladb/issues/24133

Fixes https://github.com/scylladb/scylladb/issues/24265

Closes scylladb/scylladb#24266
2025-06-04 14:00:05 +03:00
Szymon Malewski
5969809607 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106
2025-06-04 13:47:11 +03:00
Nadav Har'El
6cbcabd100 alternator: hide internal tags from users
The "tags" mechanism in Alternator is a convenient way to attach metadata
to Alternator tables. Recently we have started using it more and more for
internal metadata storage:

  * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute
  * CreateTable stores provisioned throughput in tags
    system:provisioned_rcu and system:provisioned_wcu
  * CreateTable stores the table's creation time in a tag called
    system:table_creation_time.

We do not want any of these internal tags to be visible to a
ListTagsOfResource request, because if they are visible (as before this
patch), systems such as Terraform can get confused when they suddenly
see a tag which they didn't set - and may even attempt to delete it
(as reported in issue #24098).

Moreover, we don't want any of these internal tags to be writable
with TagResource or UntagResource: If a user wants to change the TTL
setting they should do it via UpdateTimeToLive - not by writing
directly to tags.

So in this patch we forbid read or write to *any* tag that begins
with the "system:" prefix, except one: "system:write_isolation".
That tag is deliberately intended to be writable by the user, as
a configuration mechanism, and is never created internally by
Scylla. We should have perhaps chosen a different prefix for
configurable vs. internal tags, or chosen more unique prefixes -
but let's not change these historic names now.

This patch also adds regression tests for the internal tags features,
failing before this patch and passing after:
1. internal tags, specifically system:ttl_attribute, are not visible
   in ListTagsOfResource, and cannot be modified by TagResource or
   UntagResource.
2. system:write_isolation is not internal, and be written by either
   TagResource or UntagResource, and read with ListTagsOfResource.

This patch also fixes a bug in the test where we added more checks
for system:write_isolation - test_tag_resource_write_isolation_values.
This test forgot to remove the system:write_isolation tags from
test_table when it ended, which would lead to other tests that run
later to run with a non-default write isolation - something which we
never intended.

Fixes #24098.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24299
2025-06-03 20:40:50 +03:00
Pavel Emelyanov
37e6ff1a3c Merge 'test.py: cql: run tests using bare pytest command' from Evgeniy Naydanov
Create a custom pytest test collector for .cql files and move CQL test execution logic from `CQLApprovalTest` class and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()` method.

In result, the only difference between CQLApproval and Python suite types is suffixes of test files.

Also there is a separate commit to remove dead code:

There is `write_junit_failure_report()` method in Test class which was used to generate a JUnitXML report.  But it became a dead code after removal of `write_junit_report()` function in 1e1d213592 to avoid duplication of error reporting in Jenkins (see https://github.com/scylladb/scylladb/issues/23220.)  This commit removes this method and all its implementations in subclasses.

Closes scylladb/scylladb#24301

* github.com:scylladb/scylladb:
  test.py: cql: don't exit from pytest session on failed CQL
  test.py: cql: run tests using bare pytest command
  test.py: python: set test.id according to --run_id argument
  test.py: python: pass --tmpdir from test.py to all Python tests
  test.py: remove dead code after removing of write_junit_report()
2025-06-03 19:32:06 +03:00
Pavel Emelyanov
24f430c6d2 Merge 'test.py: dtest: port next_gating tests from auth_roles_test.py' from Evgeniy Naydanov
Copy `auth_roles_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`

As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers.

Enable the test in `suite.yaml` (run in dev mode only.)

Closes scylladb/scylladb#24343

* github.com:scylladb/scylladb:
  test.py: dtest: make auth_roles_test.py run using test.py
  test.py: dtest: add wait_for_any_log() to tools/log_utils.py
  test.py: dtest: add part of tools/assertions.py
  test.py: dtest: pickup latest code for retrying.py from dtest
  test.py: dtest: copy unmodified auth_roles_test.py
2025-06-03 18:54:47 +03:00
Patryk Jędrzejczak
8756c233e0 test: test_raft_recovery_user_data: disable hinted handoff
The test is currently flaky, writes can fail with "Too many in flight
hints: 10485936". See scylladb/scylladb#23565 for more details.

We suspect that scylladb/scylladb#23565 is caused by an infrastructure
issue - slow disks on some machines we run CI jobs on.

Since the test fails often and investigation doesn't seem to be easy,
we first deflake the test in this patch by disabling hinted handoff.

For replacing nodes, we provide `cfg` because there should have been
`cfg` in the first place. The test was correct anyway because:
- `tablets_mode_for_new_keyspaces` is set to `true` by default in
  test/cluster/suite.yaml,
- `endpoint_snitch` is set to `GossipingPropertyFileSnitch` by default
  if the property file is provided in `ScyllaServer.__init__`.

Ref scylladb/scylladb#23565

We should backport this patch to 2025.2 because this test is also flaky
on CI jobs using 2025.2. Older branches don't have this test.

Closes scylladb/scylladb#24364
2025-06-03 17:48:42 +02:00
Nadav Har'El
ac70e34de9 test/alternator: verify that DeleteItem returns an empty object
A user on StackOverflow (https://stackoverflow.com/questions/79650278)
reported that DeleteItem returns the apropriate response (an empty
object) on DynamoDB, but doesn't on "DynamoDB Local" (Amazon's local
mock of DynamoDB). I wrote the test in this patch to make sure that
Alternator doesn't have this bug, and indeed it doesn't: When DeleteItem
is used without any option that asks for additional output, its reponse
is, as expected, an empty object.

As usual, the new test passes on both Alternator and AWS DynamoDB.
(I didn't actually test on DynamoDB Local, I have some problems with
running that, but it doesn't matter, we have no intention of testing
DynamoDB Local).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24359
2025-06-03 18:47:34 +03:00
Avi Kivity
744015cf26 test.py: allow cmake configuration and ./configure.py configuration to coexist
Cmake emits its build.ninja into build/, while configure.py emits
build.ninja into ./. test.py uses this difference to choose the directory
structure to test.

The problem is that vscode will randomly call cmake to understand the
directory structure, so we end up with both build.ninja set up.

Invert the logic to look for ./build.ninja to determine the mode (instead
of build/build.ninja which can exist even if the user uses traditional
configuration).

It can still happen that a stray ./build.ninja exists (for example due
to switching branches), but that is rarer than having vscode auto-create
it.

Closes scylladb/scylladb#24269
2025-06-03 16:46:41 +03:00
Piotr Dulikowski
f6669422e1 Merge 'test.py: refactor test facades for better error handling' from Andrei Chekun
Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings.
If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself.

Closes scylladb/scylladb#24307

* github.com:scylladb/scylladb:
  test.py: enhance boost_facade missing log file handling
  test.py: switch using f-string instead format in facades
2025-06-03 14:03:07 +02:00
Pavel Emelyanov
96029c7c93 Update seastar submodule
* seastar d7ff58f2...26badcb1 (22):
  > http/client: Skip HEAD reply body processing
  > httpd: Remove unused connection::_req member
  > httpd: Don't write body for HEAD replies
  > http: Move trailing chunk write into reply.cc
  > http_client: Add ECONNRESET to retryable errors
  > stall_detector: no backtrace if exception
  > http: Add test for "aborted" client
  > http: in the client, fix malforming of requests with zero-sized bodies
  > http: Track bytes read from a response
  > http: Add test for improper client handling of aborted requests
  > aio_storage_context: Rename iocb_pool::_iocb_pool to _all_iocbs
  > resource: Add some debug-level logging to memory allocation
  > resource: Rework sysconf memory fallback
  > resource: Indentation fix after previous patch
  > resource: Calculate available memory from NUMA nodes
  > resource: Move NUMA nodes vector evaluation up
  > reactor: Drop _reuseport boolean
  > reactor: Simplify network stack creation and initialization
  > reactor: Remove write-only _thread_id
  > reactor: Keep task-queues in std::array instead of static_vector
  > reactor: Mark _id and task_queue::_id const
  > memory: Report oversized alloc count as metric

scylla-gdb update included:

The reactor::_task_queues can be std::array or unique ptrs. Also check
the tq_ptr for being nullptr, as array doesn't have "size" only
"capacity" and can have non-registered groups.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24294
2025-06-03 13:47:05 +03:00
Nadav Har'El
e32559758a test/alternator: any response from healthcheck means server is alive
In the Alternator tests we check (in dynamodb_test_connect()) after
every test that the server is still alive, so we can blaim the test
that just ran if it crashes the server. We check the server's health
using a simple GET response, which works on both DynamoDB and
Alternator, e.g.,
```
$ curl http://dynamodb.us-east-2.amazonaws.com/
healthy: dynamodb.us-east-2.amazonaws.com
```

However, it turns out that new versions of DynamoDB Local - Amazon's
local mock of DynamoDB, for some reason insists that all requests -
including this health check - must be signed, so our unsigned health
request is rejected with error 400, saying the request must be signed.
So the current code which insists that the response have error code
200, fails and the test incorrectly things that DynamoDB Local crashed
during the test.

The fix is trivial: Just don't check that the error code is 200.
Any HTTP response from the server means it is still alive! If the
server is not alive, we will get an exception, not any HTTP response,
and this will lead the code to the "server has crashed" case.

Refs #7775

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-03 12:25:51 +03:00
Nadav Har'El
9732545958 test/alternator: fall back to legal-looking access key id
When the Alternator tests run against Scylla, they figure out (using
CQL) the correct username and password needed to connect. When it can't,
we fell back to some silly pair 'unknown_user', 'unknown_secret',
assuming that the server won't check it anyway.

It turns out that if we want to run tests against new version of
DynamoDB Local (Amazon's local mock of DynamoDB), it indeed doesn't
authentication, but starting in DynamoDB Local 2.0, it does check that
the access key ID (the username) itself is valid, and considers
"unknown_user" to be invalid because it contains an underscore -
AWS_ACCESS_KEY_ID must only contains letters and numbers.
See https://repost.aws/articles/ARc4hEkF9CRgOrw8kSMe6CwQ/ for Amazon's
explanation for this change in DynamoDB Local 2.

The trivial fix is to remove the underscore from the silly username.
After this patch, Alternator tests can connect to DynamoDB Local.
They still can't complete correctly - this will be fixed in the next
patch.

Refs #7775

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-03 12:25:51 +03:00
Evgeniy Naydanov
f0d283afd7 test.py: cql: don't exit from pytest session on failed CQL
There is the fixture in `test/cql/conftest.py` which checks
CQL connection after each test and exit from pytest session if
the connection was failed.  For CQL tests it's simply no
difference what to use: pytest.exit() or pytest.fail() because
tests are executing one-by-one in separate pytest sessions.

Change it to pytest.fail() for future integration into a single
pytest session.
2025-06-03 07:54:51 +00:00
Evgeniy Naydanov
cdc4b520da test.py: cql: run tests using bare pytest command
Create a custom pytest test collector for .cql files and
move CQL test execution logic from `CQLApprovalTest` class
and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()`
method.

In result, the only difference between CQLApproval and Python
suite types is suffixes of test files.
2025-06-03 07:54:51 +00:00
Evgeniy Naydanov
0fba0df4f6 test.py: python: set test.id according to --run_id argument
test.py uses `Test.id` attribute to distinguish repeated tests
in one run and pass it as `--run_id` CLI argument to pytest.
Use this argument to set the test's `id` attribute inside pytest
session to fix problem with paths to some test artifacts.
2025-06-03 07:54:51 +00:00
Michał Chojnowski
ea4d251ad2 compress: fix a use-after-free in dictionary_holder::get_recommended_dict()
The function calls copy() on a foreign_ptr
(stored in a map) which can be destroyed
(erased from the map) before the copy() completes.
This is illegal.

One way to fix this would be to apply an rwlock
to the map. Another way is to wrap the `foreign_ptr`
in a `lw_shared_ptr` and extend its lifetime over
the `copy()` call. This patch does the latter.

Fixes scylladb/scylladb#24165
Fixes scylladb/scylladb#24174

Closes scylladb/scylladb#24175
2025-06-03 10:42:38 +03:00
Piotr Dulikowski
f5b18d275b Merge 'test/boost: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek
This PR adjusts existing Boost tests so they respect the invariant
introduced by enabling `rf_rack_valid_keyspaces` configuration option.
We disable it explicitly in more problematic tests. After that, we
enable the option by default in the whole test suite.

Fixes scylladb/scylladb#23958

Backport: backporting to 2025.1 and 2025.2 to be able to test the implementation there too.

Closes scylladb/scylladb#23802

* github.com:scylladb/scylladb:
  test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
  test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
  test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
2025-06-03 08:43:34 +02:00
Evgeniy Naydanov
ac972231fa test.py: python: pass --tmpdir from test.py to all Python tests
`--tmpdir` CLI argument is used to point to the directory with logs
and other test artifacts.  It has default values both in test.py
and pytest (`test/conftest.py`). These values are the same.  But for
non-default values it's required to pass it from test.py to pytest
explicitly.  This done for Topology tests, but not for all Python test
suites.  The commit fixes the problem by adding the argument in
`_prepare_pytest_command()` method of the base `PythonTest` class.
2025-06-03 05:45:05 +00:00
Evgeniy Naydanov
17401aaf31 test.py: remove dead code after removing of write_junit_report()
There is `write_junit_failure_report()` method in Test class which
was used to generate a JUnitXML report.  But it became a dead code
after removal of `write_junit_report()` function in
1e1d213592 to avoid duplication of
error reporting in Jenkins (see #23220.)  This commit removes this
method and all its implementations in subclasses.
2025-06-03 02:28:41 +00:00
Pavel Emelyanov
eb5160cb4d api: Drop no longer used container_to_vec helper
All callers are patched to use std::ranges.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-06-02 20:09:58 +03:00
Pavel Emelyanov
f6afc02951 api: Use std::ranges to stringify collections
There are several endpoints that have collection of objects at hand and
want a vector of corresponding strings. Use std::ranges library for
conversion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-06-02 20:09:56 +03:00
Pavel Emelyanov
b943902ff7 api: Use std::ranges to convert std::set<sstring> to std::vector<string>
The column_family/get_sstables_for_key endpoint collects a set of
sstable names and converts it to vector of strings using homebrew
helper. The std::ranges convertor works just as nice.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-06-02 20:09:28 +03:00
Pavel Emelyanov
6809ab5198 api: Use db::config::data_file_directories()' vector directly
The return value is std::vector<sstring>, there's no need to
additionally convert it to std::vector<sstring>.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-06-02 20:06:33 +03:00
Pavel Emelyanov
06ee60c238 api: Coroutinize get_live_endpoint()
To be summetrical with its get_down_endpoint() peer and to make further
patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-06-02 19:52:55 +03:00
Michał Chojnowski
dd878505ca test: add test_sstable_compression_dictionaries_upgrade.py 2025-06-02 15:49:29 +02:00
Michał Chojnowski
d3cb873532 test.py: add --run-internet-dependent-tests
Later, we will add upgrade tests, which need to download the previous
release of Scylla from the internet.

Internet access is a major dependency, so we want to make those tests
opt-in for now.
2025-06-02 15:49:29 +02:00
Michał Chojnowski
5da19ff6a6 pylib/manager_client: add server_switch_executable
Add an util for switching the Scylla executable during the test.
Will be used for upgrade tests.
2025-06-02 15:03:08 +02:00
Michał Chojnowski
1ff7e09edc test/pylib: in add_server, give a way to specify the executable and version-specific config
This will be used for upgrade tests.
The cluster will be started with an older executable and without configs
specific to newer versions.
2025-06-02 15:03:08 +02:00
Michał Chojnowski
2ef0db0a6b pylib: pass scylla_env environment variables to the topology suite
I want to add an upgrade test under the topology suite.
To work, it will have to know the path to the tested Scylla
executable, so that it can switch the nodes to it.

The path could be passed by various means and I'm not sure
which what method is appropriate.

In some other places (e.g. the cql suite) we pass the path via
the `SCYLLA` environment variable and this patch follows that example.

`PythonTestSuite` (parent class of `TopologySuite`) already has that
variable set in `self.scylla_env`, and passes it around.
However, `TopologySuite` uses its own `run()`, and so it implicitly
overrides the decision to pass `self.scylla_env` down. This patch
changes that, and after the patch we apply the `self.scylla_env` to the
environment for topology tests.

This might has some unforeseen side effects for coverage measurement,
because AFAICS the (only) other variable in `self.scylla_env` is
`LLVM_PROFILE_FILE`.
But topology tests don't run Scylla executables themselves
(they only send command to the cluster manager started externally),
so I figure there should be no change.
2025-06-02 15:03:08 +02:00
Michał Chojnowski
34098fbd1f test/pylib: add get_scylla_2025_1_executable()
Adds a function which downloads and installs (in `~/.cache`)
the Scylla 2025.1, for upgrade tests.

Note: this introduces an internet dependency into pylib,
AFAIK the first one.

We already have some other code for downloading existing Scylla
releases, written for different purposes, in `cqlpy/fetch_scylla.py`.
I made zero effort to reuse that in any way.

Note: hardcoding the package version might be uncool,
but if we want "better" version selection (e.g. the newest patch version
in the given branch), we should have a separate library (or web service)
for that, and share it with CCM/SCT.
If we add a separate automatic version selection mechanism here,
we are going to end up with yet another half-broken Scylla version
selector, with yet different syntax and semantics than the other ones.

We never clear the downloaded and unpacked files.
This could become a problem in the future.
(At which point we can add some mechanism that deletes cached archives
downloaded more than a week ago.)
2025-06-02 15:03:08 +02:00
Michał Chojnowski
cc7432888e pylib/scylla_cluster: give a way to pass executable-specific options to nodes
I'm trying to adapt pylib to multi-version tests.
(Where the Scylla cluster is upgraded to a newer Scylla version
during the test).

Before this patch, the initial config (where "config" == yaml file + CLI args)
of the nodes is hardcoded in scylla_cluster.py.
The problem is that this config might not apply to past versions,
so we need some way to give them a different config.
(For example, with the config as it is before the patch,
a Scylla 2025.1 executable would not boot up because it does not
know the `group0_voter_handler` logger).

In this patch, we create a way to attach version-specific
config to the executable passed to ScyllaServer.
2025-06-02 15:03:08 +02:00
Michał Chojnowski
63218bb094 dbuild: mount "$XDG_CACHE_HOME/scylladb"
We will use it to keep a cache of artifact downloads for upgrade tests,
across dbuild invocations.
2025-06-02 15:03:08 +02:00
Andrei Chekun
738cbc07b5 test.py: enhance boost_facade missing log file handling
If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself.
2025-06-02 12:17:10 +02:00
Andrei Chekun
5f6740c1fa test.py: switch using f-string instead format in facades
Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings.
2025-06-02 12:16:47 +02:00
Pavel Emelyanov
7fef2c4f61 Merge 'test.py: fix metrics gathering' from Andrei Chekun
Move of the run_process done in https://github.com/scylladb/scylladb/pull/24091 was not fully correct. The method run_process was not overridden in the class ResourceGatherOn, so no metrics are collected at all.
Additionally, fix metrics DB location second time.

Closes scylladb/scylladb#24306

* github.com:scylladb/scylladb:
  test.py: fix metrics DB location
  test.py: fix the possibility to gather resource metrics for test
2025-06-02 13:12:42 +03:00
Botond Dénes
e82b0dff3e Merge 'Move mutation_fragment_v2::kind into mutation_fragment_v2::data, mutation_fragment::kind into mutation_fragment::data' from Radosław Cybulski
Move mutation_fragment_v2::kind field into mutation_fragment_v2::data.
Move mutation_fragment::kind field into mutation_fragment::data.

In both cases the move reduces size of the object by half (to 8 bytes).

On top of testsuite this patch was tested manually. First patched scylla was run. A keyspace and a table was created, with columns TEXT, INT, DOUBLE, BOOLEAN and TIMESTAMP. One row was inserted, `select *` was executed to make sure it's there. Then scylla was terminated and non-patched scylla was run, another row was inserted and `select *` was run to verify both rows exist. After this patched scylla was against started, third row was inserted and final `select *` was done to verify all three rows are there.

This is partial fix to https://github.com/scylladb/scylla-enterprise/issues/5288 issue.

Closes scylladb/scylladb#23452

* github.com:scylladb/scylladb:
  Move mutation_fragment::kind into data object
  Make mutation_fragment::kind enum 1 byte size
  Move mutation_fragment_v2::kind into data object
  Make mutation_fragment_v2::kind enum 1 byte size
2025-06-02 10:57:17 +03:00
Evgeniy Naydanov
e780164a67 test.py: dtest: make auth_roles_test.py run using test.py
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests, and code for old
ScyllaDB versions.

Enable the test in suite.yaml (run in dev mode only)
2025-06-02 05:14:41 +00:00
Evgeniy Naydanov
145c2fed97 test.py: dtest: add wait_for_any_log() to tools/log_utils.py
Copy wait_for_any_log() function from dtest tools/log_utils.py
with few modifications:

 - Add type hints;
 - Change timeout for node.watch_log_for() calls from 0 to 0.1
   because dtest shim's implementation uses asyncio.timeout()
   and 0 means not "one time" but "never run";
 - Use set() instead of list() for `ret` variable;
 - Remove redundant `found` variable.
 - Remove `remaining` variable and use shallow copies to make
   the code more correct.  As a side effect this makes the
   TimeoutError message more correct too;
 - Use f-string formatting for TimeoutError message;
2025-06-02 05:14:41 +00:00
Evgeniy Naydanov
ff2aea7e5b test.py: dtest: add part of tools/assertions.py
Copy few assertion functions from dtest tools/assertions.py:

 - assertion_exception()
 - assertion_invalid()
 - assertion_one()
 - assertion_all()
2025-06-02 05:14:41 +00:00
Evgeniy Naydanov
9d70b6307b test.py: dtest: pickup latest code for retrying.py from dtest
Sync retrying.py with dtest.
2025-06-02 05:14:41 +00:00
Evgeniy Naydanov
40464faef3 test.py: dtest: copy unmodified auth_roles_test.py
The test is disabled in suite.yaml
2025-06-02 05:14:41 +00:00
Jenkins Promoter
7d562c24b1 Update pgo profiles - aarch64 2025-06-01 04:45:06 +03:00
Jenkins Promoter
75cf16afa2 Update pgo profiles - x86_64 2025-06-01 04:31:56 +03:00
Botond Dénes
c52aec3d2f Merge 'tablets: fix missing data after tablet merge ' from Raphael Raph Carvalho
Consider the following scenario:

1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:

1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]

With cache:

1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.

This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.

Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.

Fixes: https://github.com/scylladb/scylladb/issues/23313

This change needs to be backported to all supported versions which implement tablet merge.

Closes scylladb/scylladb#24287

* github.com:scylladb/scylladb:
  replica: Fix range reads spanning sibling tablets
  test: add reproducer and test for mutation source refresh after merge
  tablets: trigger mutation source refresh on tablet count change
2025-05-30 15:37:29 +03:00
Anna Stuchlik
28cb5a1e02 doc: add OS support for ScyllaDB 2025.2
This commit adds the information about support for platforms
in ScyllaDB version 20252.

Fixes https://github.com/scylladb/scylladb/issues/24180

Closes scylladb/scylladb#24263
2025-05-30 12:23:59 +03:00
Calle Wilund
942477ecd9 encryption/utils: Move encryption httpclient to "general" REST client
Fixed #24296

While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR)
is not general enough to be called a HTTP client as such, it is general
enough to be called a REST client (limited to stateless, single-op REST
calls).

Other code, like general auth integrations (hello Azure) and similar
could reuse this to lessen code duplication.

This patch simply moves the httpclient class from encryption to "rest"
namespace, and explicitly "limits" it to such usage. Making an alias
in encryption to avoid touching more files than needed.

Closes scylladb/scylladb#24297
2025-05-30 12:21:51 +03:00
Pavel Emelyanov
a65ffdd0df test/result_utils: Do not assume map_reduce reducing order
When map_reduce is called on a collection, one shouldn't expect that it
processes the elements of the collection in any specific order.

Current test of map-reduce over boost outcome assumes that if reduce
function is the string concatenation, then it would concatenate the
given vector of strings in the order they are listed. That requirement
should be relaxed, and the result may have reversed concatentation.

Fixes scylladb/scylladb#24321

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24325
2025-05-30 09:38:59 +02:00
Michael Litvak
3a1be33143 test_cdc_generation_publishing: fix to read monotonically
The test test_multiple_unpublished_cdc_generations reads the CDC
generation timestamps to verify they are published in the correct order.
To do so it issues reads in a loop with a short sleep period and checks
the differences between consecutive reads, assuming they are monotonic.

However the assumption that the reads are monotonic is not valid,
because the reads are issued with consistency_level=ONE, thus we may read
timestamps {A,B} from some node, then read timestamps {A} from another
node that didn't apply the write of the new timestamp B yet. This will
trigger the assert in the test and fail.

To ensure the reads are monotonic we change the test to use consistency
level ALL for the reads.

Fixes scylladb/scylladb#24262

Closes scylladb/scylladb#24272
2025-05-30 08:35:56 +02:00
Pavel Emelyanov
086777e5de Merge 'test.py: python: run tests using bare pytest command' from Evgeniy Naydanov
Main change is splitting logic of `PythonTest.run()` method into `PythonTest.run_ctx()` context manager and `PythonTest.run()` method itself and add the `host` fixture which uses `PythonTest.run_ctx()` context manager to setup and teardown ScyllaDB node if `--test-py-init` argument is used. Otherwise, this fixture returns a value of `--host` CLI argument. Use dynamic scope provided by `testpy_test_fixture_scope()` function instead of `session` to maintain compatibility with `test.py` and `./run` scripts.

Other related changes:

* Add utility `get_testpy_test()` function to `pylib.suite.base` which combines all required steps to create an instance of `Test` class and rework `testpy_test` fixture to use it.

* Switch to use dynamic fixture scope controlled by `--test-py-init` CLI argument to improve compatibility with test.py.  And because in test.py mode the scope is `session`, also change default event loop scope to `session`.

* Convert `get_valid_alternator_role()` to fixture to have more control on the scope of the cache used. Additionally, function `new_dynamodb_session()` was also converted to a fixture, because it uses `get_valid_alternator_role()`.

* Replace dups of `cql` and `this_dc` fixtures in `rest_api` and `pylib/cql_repl` with imports from `cqlpy`.

* Change `build_mode` fixture to return "unknown" if no --mode arguments provided (this is mainly for alternator and cqlpy tests)

* Create a parent directory for a test log file just before opening this file in `run_test()` function instead of having this as a side effect in `Test.__init__()`.

And changes that remove pytest CLI argument duplicates to be able to run tests from different test suites in one pytest session:

* Add 3 supplementary functions to `test.pylib.suite.python`: `add_host_option()` (which adds `--host` options to pytest session), `add_cql_connection_options()` (which adds `--port`, and `--ssl`), and `--add-s3-options` (which adds options related to S3 connection.) Each function decorated with `@cache` decorator to be executed once per pytest session and avoid CLI options duplication for runs which executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables` in one pytest session.

* Move `--auth_username` and `--auth_password` options from `cluster/conftest.py` to add_scylla_cql_connection_options() and slightly rework `cql` fixture to support these options.

* Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts from `cluster/object_store/conftest.py` because they are not used in these suite.

* Remove `--omit-scylla-output` CLI option from pytest argparser. Instead, remove it from `sys.argv` in `cqlpy/run.py`.  Also, no need to check this option in `alternator/run`.

Closes scylladb/scylladb#23849

* github.com:scylladb/scylladb:
  test.py: python: run tests using bare pytest command
  test.py: rework testpy_test fixture
  test.py: alternator: convert get_valid_alternator_role() to fixture
  test.py: python: split logic of PythonTest.run()
  test.py: add credentials options to add_cql_connection_options()
  test.py: python: remove dups of cql and this_dc fixtures
  test.py: remove duplication of pytest CLI options
  test.py: remove unused CLI options
  test.py: remove `--omit-scylla-output` from pytest argparser
  test.py: set build_mode to "unknown" if no --mode argument
  test.py: create directory for test log in run_test()
2025-05-30 08:48:43 +03:00
Botond Dénes
7db956965e mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members
Max purgeable has two possible values for each partition: one for
regular tombstones and one for shadowable ones. Yet currently a single
member is used to cache the max-purgeable value for the partition, so
whichever kind of tombstone is checked first, its max-purgeable will
become sticky and apply to the other kind of tombstones too. E.g. if the
first can_gc() check is for a regular tombstone, its max-purgeable will
apply to shadowable tombstones in the partition too, meaning they might
not be purged, even though they are purgeable, as the shadowable
max-purgeable is expected to be more lenient. The other way around is
worse, as it will result in regular tombstone being incorrectly purged,
permitted by the more lenient shadowable tombstone max-purgeable.
Fix this by caching the two possible values in two separate members.
A reproducer unit test is also added.

Fixes: scylladb/scylladb#23272

Closes scylladb/scylladb#24171
2025-05-29 22:52:08 +03:00
Avi Kivity
f0ec9dd8f2 Merge 'utils/logalloc: enforce the max contiguous allocation size limit' from Michał Chojnowski
This series fixes the only known violation of logalloc's allocation size limits (in `chunked_managed_vector`), and then it make those limits hard.

Before the series, LSA handles overly-large allocations by forwarding them to the standard allocator. After the series, an attempt to do an overly large allocations via LSA will trigger an `on_internal_error` instead.

We do this because the allocator fallback logic turned out to have subtle and problematic accounting bugs.
We could fix them, or we can remove the mechanism altogether.
It's hard to say which choice is better. This PR arbitrarily makes the choice to remove the mechanism.
This makes the logic simpler, at the risk of escalating some allocation size bugs to crashes.

See the descriptions of individual commits for more details.

Fixes scylladb/scylladb#23850
Fixes scylladb/scylladb#23851
Fixes scylladb/scylladb#23854

I'm not sure if any of this should be backported or not.

The `chunked_managed_vector` fix could be backported, because it's a bugfix. It's an old bug, though, and we have never observed problems related to it.

The changes to `logalloc` aren't supposed to be fixing any observable problem, so a backport probably has more risk than benefit in this case.

Closes scylladb/scylladb#23944

* github.com:scylladb/scylladb:
  utils/logalloc: enforce LSA allocation size limits
  utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
2025-05-29 22:11:41 +03:00
Szymon Malewski
18d237a393 alternator/executor: Added checks in batch_write_item
This patch adds checks validating 'BatchWriteItem' requests mostly to avoid ugly fallback message.
It changes request's behaviour in case of an empty array of WriteRequests - previously such an array was ignored and whole request might succeed, now it raises ValidationException, following the documentation and behaviour of DynamoDB.
Patch includes tests in test_manual_requests (`test_batch_write_item_invalid_payload`, `test_batch_write_item_empty_request_list`) testing with several offending cases.

Fixes #23233

Closes scylladb/scylladb#23878
2025-05-29 20:33:57 +03:00
Patryk Jędrzejczak
c21692f3a6 Merge 'token_range_vector: fragment' from Avi Kivity
token_range_vector is a sequence of intervals of tokens. It is used
to describe vnodes or token ranges owned by shards.

Since tokens are bloated (16 bytes instead of 8), and intervals are bloated
(40 byte of overhead instead of 8), and since we have plenty of token ranges,
such vectors can exceed our allocation unit of 128 kB and cause allocation stalls.

This series fixes that by first generalizing some helpers and then changing
token_range_vector to use chunked_vector.

Although this touches IDL, there is no compatibility problem since the encoding
for vector and chunked_vector are identical.

There is no performance concern since token_range_vector is never used on
any hot path (hot paths always contain a partition key).

Fixes #3335.
Fixes #24115.

No backport: minor performance fix that isn't a regression.

Closes scylladb/scylladb#24205

* https://github.com/scylladb/scylladb:
  dht: fragment token_range_vector
  partition_range_compat: generalize wrap/unwrap helpers
2025-05-29 18:45:13 +02:00
Robert Bindar
c570941692 Add nodetool refresh --scope option
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.

The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23861
2025-05-29 16:12:09 +03:00
Evgeniy Naydanov
0ee0e3f14d test.py: python: run tests using bare pytest command
Add the `host` fixture which uses `PythonTest.run_ctx()` context manager
to setup and teardown ScyllaDB node if `--test-py-init` argument is used.
Otherwise, this fixture returns a value of `--host` CLI argument.

Use dynamic scope provided by `testpy_test_fixture_scope()` function
instead of `session` to maintain compatibility with test.py and ./run
scripts.
2025-05-29 12:33:41 +00:00
Evgeniy Naydanov
b67048f3ee test.py: rework testpy_test fixture
Add utility `get_testpy_test()` function to `pylib.suite.base` which
combines all required steps to create an instance of `Test` class.

Remove redundant `testpy_testsuite` fixture.

Switch to use dynamic fixture scope controlled by `--test-py-init` CLI
argument to improve compatibility with test.py.  And because in test.py
mode the scope is `session`, also change default event loop scope to
`session`.

The fixture is None for test.py mode.

test.py runs tests file-by-file as separate pytest sessions, so, `session`
scope is effectively close to be the same as `module` (can be a difference
in the order.)  In case of running tests with bare pytest command, we need
to use `module` scope to maintain same behavior as test.py, since we run
all tests in one pytest session.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
b65cb517b8 test.py: alternator: convert get_valid_alternator_role() to fixture
Convert `get_valid_alternator_role()` to fixture to have more control
on the scope of the cache used.

Additionally, function `new_dynamodb_session()` was also converted to
a fixture, because it uses `get_valid_alternator_role()`.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
1f94a9c052 test.py: python: split logic of PythonTest.run()
Split logic of `PythonTest.run()` method into `PythonTest.run_ctx()`
context manager and `PythonTest.run()` method itself.

Done this to reuse setup/teardown code with bare pytest command runs.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
27cbfc77fb test.py: add credentials options to add_cql_connection_options()
Move `--auth_username` and `--auth_password` options from
`cluster/conftest.py` to add_cql_connection_options() and slightly
rework `cql` fixture to support these options.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
2bba4acdea test.py: python: remove dups of cql and this_dc fixtures
Replace dups of `cql` and `this_dc` fixtures in `rest_api` and
`pylib/cql_repl` with imports from `cqlpy`.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
6780461df8 test.py: remove duplication of pytest CLI options
Add 3 supplementary functions to `test.pylib.suite.python`:
`add_host_option()` (which adds `--host` options to pytest session),
`add_cql_connection_options()` (which adds `--port`, and `--ssl`),
and `--add-s3-options` (which adds options related to S3 connection.)
Each function decorated with `@cache` decorator to be executed once per
pytest session and avoid CLI options duplication for runs which
executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables`
in one pytest session.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
056c5db829 test.py: remove unused CLI options
Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts
from `cluster/object_store/conftest.py` because they are not used
in these suite.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
b7b68355ef test.py: remove --omit-scylla-output from pytest argparser
Remove `--omit-scylla-output` CLI option from pytest argparser.
Instead, remove it from `sys.argv` in `cqlpy/run.py`.  Also, no need
to check this option in `alternator/run`.
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
f262d4c323 test.py: set build_mode to "unknown" if no --mode argument
Change `build_mode` fixture to return "unknown" if no --mode arguments
provided (this is mainly for alternator and cqlpy tests)
2025-05-29 12:15:28 +00:00
Evgeniy Naydanov
30d542b8f1 test.py: create directory for test log in run_test()
Create a parent directory for a test log file just before opening this
file in `run_test()` function instead of having this as a side effect
in `Test.__init__()`.
2025-05-29 12:15:28 +00:00
Piotr Dulikowski
c8d52a4318 Merge 'test.py: dtest: port bypass_cache_test.py' from Evgeniy Naydanov
Copy bypass_cache_test.py from scylla-dtest test suite and make it works with test.py

As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers, and add missed `single_node` marker description to pytest.ini

Enable the test in suite.yaml (run in dev mode only.)

Also add missed `ScyllaCluster.nodetool()` method in dtest shim code.

Closes scylladb/scylladb#24230

* github.com:scylladb/scylladb:
  test.py: dtest: make bypass_cache_test.py run using test.py
  test.py: dtest: add missed ScyllaCluster.nodetool()
  test.py: dtest: copy unmodified bypass_cache_test.py
2025-05-29 13:48:10 +02:00
Michał Chojnowski
cb02d47b10 utils/logalloc: enforce LSA allocation size limits
In order to guarantee a decent upper limit on fragmentation,
LSA only handles allocations smaller than 0.1 of a segment.

Allocations larger than this limit are permitted, but they are
not placed in LSA segments. Instead, they are forwarded to
the standard allocator.

We don't really have any use case for this "fallback".
As far as I can tell, it only exists for "historical"
reasons, from times where there were some data structures
which weren't fully adapted to LSA yet.

We don't the fallback to be used.
Long-lived standard allocations are undesirable.
They have higher internal fragmentation than LSA
allocations, and they can cause external fragmentation
in the standard allocator. So we want to eliminate them all.

The only reason to keep the fallback is to soften the impact
if some bug results in limit-exceeding LSA allocations happening
in production. In principle, the fallback turns a crash
(or something similarly drastic) into just a performance problem.

However, it turns out that the fallback is buggy.
Recently we had a bug which caused limit-exceeding LSA allocations
to happen.
And then it turned out that LSA reclaim doesn't deal fully correctly
with evictable non-LSA allocations, and the dirty_memory_manager
accounting for non-LSA allocations is completely wrong.
This resulted in subtle, serious, and hard to understand stability
problems in production.

Arguably the biggest problem is that the "fallback" allocations
weren't reported in any way. They were happening in some tests,
but they were silently permitted, so nobody noticed that they
should be eliminated. If we just had a rate-limited error log
that reports fallback allocations, they would have never got
into a release.

So maybe we could fix the fallback, add more tests for it,
add a warning for when it's used, and keep it.

But this PR instead opts for removing the fallback mechanism
altogether and failing fast. After the patch, if a non-conforming
allocation happens, it will trigger an `on_internal_error`.
With this, we risk a greater impact if some non-conforming allocations
happen in production, but we make the system simpler.

It's hard to say if it's a good tradeoff.
2025-05-29 13:05:08 +02:00
Piotr Dulikowski
555925c66b Merge 'generic_server: transport: improve stats counting and shedding' from Marcin Maliszkiewicz
The patch removes connection advertising functions and moves the logic to constructors and destructors, providing a more robust way of counting connections. This change was also necessary to allow skipping the connection process function during shedding, as the active connections counter needs to be decremented.

The patch doesn't fix any active bug, just improves the flow.

Backport: none, it's a cosmetic change

Closes scylladb/scylladb#23890

* github.com:scylladb/scylladb:
  generic_server: make shutdown() return void
  generic_server: skip connection processing logic after shedding the connection
  transport: generic_server: remove no longer used connection advertising code
  transport: move new connection trace logs into connection class ctor/dtor
  transport: move cql connections counting into connection class ctor/dtor
2025-05-29 12:49:58 +02:00
Avi Kivity
c00824c7df Merge 'transport: Implement SCYLLA_USE_METADATA_ID support' from Andrzej Jackowski
Metadata id was introduced in CQLv5 to make metadata of prepared
statement metadata consistent between driver and database.
This commit introduces a protocol extension that allows to use the same
mechanism in CQLv4. As CQLv5 is currently unsupported in ScyllaDb (as well
as in some of the drivers), the motivation is to allow fixing https://github.com/scylladb/scylladb/issues/20860.

This change:
     - Implement metadata::calculate_metadata_id()
     - Implement SCYLLA_USE_METADATA_ID protocol extension for CQLv4
     - Added description of SCYLLA_USE_METADATA_ID in documentation
     - Add boost tests to confirm correctness of the function
     - Add python tests for table metadata change corner-cases

Fixes scylladb/scylladb#20860

Also see related https://scylladb.atlassian.net/wiki/spaces/RND/pages/42238631/MetadataId+extension+in+CQLv4+Requirement+Document

No backport needed (unless specifically requested by a customer), because there are existing workarounds for the issue

Closes scylladb/scylladb#23292

* github.com:scylladb/scylladb:
  test: add tests for prepared statement metadata consistency corner cases
  transport: implement SCYLLA_USE_METADATA_ID support
  cql3: implement metadata::calculate_metadata_id()
2025-05-29 12:27:31 +03:00
Andrei Chekun
0c5676ffb4 test.py: fix metrics DB location
This was already fixed, but unintentionally during rebases it was
reverted and merged to master in the same PR.
2025-05-28 20:13:38 +02:00
Andrei Chekun
6e92791538 test.py: fix the possibility to gather resource metrics for test
Move of the run_process done in #24091 was not fully correct. The method
run_process was not overridden in the class ResourceGatherOn, so no
metrics are collected at all.
2025-05-28 20:13:31 +02:00
Ran Regev
37854acc92 changed the string literals into the correct ones
Fixes: #23970

use correct string literals:
KMIP_TAG_CRYPTOGRAPHIC_LENGTH_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_LENGTH
KMIP_TAG_CRYPTOGRAPHIC_USAGE_MASK_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_USAGE_MASK

From https://github.com/scylladb/scylladb/issues/23970 description of the
problem (emphasizes are mine):

When transparent data encryption at rest is enabled with KMIP as a key
provider, the observation is that before creating a new key, Scylla tries
to locate an existing key with provided specifications (key algorithm &
length), with the intention to re-use existing key, **but the attributes
sent in the request have minor spelling mistakes** which are rejected by
the KMIP server key provider, and hence scylla assumes that a key with
these specifications doesn't exist, and creates a new key in the KMIP
server. The issue here is that for every new table, ScyllaDB will create
a key in the KMIP server, which could clutter the KMS, and make key
lifecycle management difficult for DBAs.

Closes scylladb/scylladb#24057
2025-05-28 13:52:30 +03:00
Pavel Emelyanov
2eed2e94ea sstables_loader: Extend logging with recently added skip-cleanup
When starting, the loader prints all its arguments into logs. Recently
added skip-cleanup one is not included, but it's good to have one too.

refs: #24139

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24206
2025-05-28 11:20:27 +03:00
David Garcia
9542bfd2b1 docs: enable ai chatbot
docs: enable ai chatbot

Closes scylladb/scylladb#24286
2025-05-28 11:04:25 +03:00
Yaron Kaikov
0831931fec .github/workflows/conflict_reminder: reduce the amount of conflict reminder for every push event
In order to avoid spamming PR author about conflicts, added a logic to
verify during push events, that in case PR is already in draft mode, we
will check when was the last notification, if it's less then 3 days, we
will skip it

Closes scylladb/scylladb#24289
2025-05-28 11:01:44 +03:00
Nadav Har'El
61581d458e Merge 'vector_index: add custom index class from Michał Hudobski
This PR adds a class that allows for validation (and in the future creating and querying) of custom indexes and implements it for vector indexes. Currently custom vector_index creation runs a usual index creation process. This PR does not change that, however it adds validation of the parameters that need to have certain values for the actual creation of the vector index in the future. The only thing left for the vector_index feature to work as intended should be the integration with the Vector Store service.

This is a continuation of https://github.com/scylladb/scylladb/pull/23720
Refs:  [VS-55
](https://scylladb.atlassian.net/browse/VS-55) (Support setting index parametrs and similarity function in CREATE INDEX)
Fixes: [VS-13](https://scylladb.atlassian.net/browse/VS-13) (Validate that the base type is numeric when creating the vector index)

[VS-13]: https://scylladb.atlassian.net/browse/VS-13?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#24212

* github.com:scylladb/scylladb:
  test/cqlpy: remove xfail and add more vector tests
  vector_index: allow options when custom class is provided
  vector_index: add custom index and vector index classes
2025-05-28 10:42:29 +03:00
Raphael S. Carvalho
53df911145 replica: Fix range reads spanning sibling tablets
We don't guarantee that coordinators will only emit range reads that
span only one tablet.

Consider this scenario:

1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.

We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-05-27 22:39:40 -03:00
Michał Hudobski
195e6a82de test/cqlpy: remove xfail and add more vector tests
We have added validation for options and type of column for vector
indexes. This commit adds tests for that validation.
2025-05-27 21:04:50 +02:00
Michał Hudobski
7a2b0179e8 vector_index: allow options when custom class is provided
We have changed the validation for the custom index to not
require the CUSTOM keyword when creating the index,
only the custom class now we change the validation for options
so that they match.
2025-05-27 21:04:50 +02:00
Michał Hudobski
3ab643a5de vector_index: add custom index and vector index classes
In this patch we add an abstract class, "custom_index", with a validate() method.
Each CUSTOM INDEX class needs to implement a concrete subclass of custom_index
which is used to validate if this type of custom index class may be used,
and whether the optional parameters passed to it are valid.

We change the existing CUSTOM INDEX validation code to use this new mechanism.

Finally this patch implements one concrete subclass for vector index.
Before this patch, the custom index type "vector_index" was allowed,
but after this patch it gains more validation of its optional parameters
(we support 4 specific parameters, with some rules on their values).
Of course, the vector index isn't actually implemented in this patch,
we are just improving the validation of the index creation statement.
2025-05-27 21:04:50 +02:00
Marcin Maliszkiewicz
7f057af1f2 replica: make non-preemptive keyspace create/update/delete functions public
As those operations will be managed by schema_applier class. This
will be implemented in following commit.
2025-05-27 20:01:35 +02:00
Marcin Maliszkiewicz
2daa630938 replica: split update keyspace into two phases
- first phase is preemptive (prepare_update_keyspace)
- second phase is non-preemptive (update_keyspace)

This is done so that schema change can be applied atomically.

Aditionally create keyspace code was changed to share common
part with update keyspace flow.

This commit doesn't yet change the behaviour of the code,
as it doesn't guarantee atomicity, it will be done in following
commits.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
fe0f4033ca replica: split creating keyspace into two functions
This is done so that in following commits insert_keyspace can be used
to atomically change schema (as it doesn't yield).
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
aceb1f9659 db: rename create_keyspace_from_schema_partition
It only creates keyspace metadata.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
f8fe51640a db: decouple functions and aggregates schema change notification from merging code 2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
52069d954f db: store functions and aggregates change batch in schema_applier
To be used in following commit.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
5fff3097a5 db: decouple tables and views schema change notifications from merging code
As post_commit() can't be fully implemented at this stage,
it was moved to interim place to keep things working.
It will be moved back later.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
6f8579e242 db: store tables and views schema diff in schema_applier
It will be used in subsequent commit for moving
notifications code.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
b74c1e9ae4 db: decouple user type schema change notifications from types merging code
Merging types code now returns generic affected_types structure which
is used both for notifications and dropping types. New static
function drop_types() replaces dropping lambda used before.

While I think it's not necessary for dropping nor notifications to
use per shard copies (like it's using before and after this patch)
it could just use string parameters or something similar but
this requires too many changes in other classes so it's out of scope
here.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
3a95edd0d7 service: unify keyspace notification functions arguments
Keyspace metadata is not used, only name is needed so
we can remove those extra find_keyspace() calls.

Moreover there is no need to copy the name.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
d7202586ca db: replica: decouple keyspace schema change notifications to a separate function
In following commits we want to separate updating code from committing
shema change (making it visible). Since notifications should be issued
after change is visible we need to separate them and call after
committing.

In subsequent commits other notification types will be moved too.

We change here order of notification calls with regards to rest
of schema updating code. I.e. before keyspace notifications triggered
before tables were updated, after the change they will trigger once
everything is updated. There is no indication that notification
listeners depend on this behaviour.
2025-05-27 19:59:47 +02:00
Marcin Maliszkiewicz
ddf9f7ae05 db: add class encapsulating schema merging
This commit doesn't yet change how schema merging
works but it prepares the ground for it.

We split merging code into several functions.
Main reasons for it are that:

- We want to generalize and create some interface
which each subsystem would use.

- We need to pull mutation's apply() out
of the code because raft will call it directly,
and it will contain a mix of mutations from more
than one subsystem. This is needed because we have
the need to update multiple subsystems atomically
(e.g. auth and schema during auto-grant when creating
a table).

In this commit do_merge_schema() code is split between
prepare(), update(), commit(), post_commit(). The idea
behind each of these phases is described in the comments.
The last 2 phases are not yet implemented as it requires more
code changes but adding schema_applier enclosing class
will help to create some copied state in the future and
implement commit() and post_commit() phases.
2025-05-27 19:33:02 +02:00
Marcin Maliszkiewicz
1eb580973c generic_server: make shutdown() return void
It's always immediately ready so no need to
return future<>.
2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz
d76d1766ad generic_server: skip connection processing logic after shedding the connection
Since input and output descriptors are already closed
at this point there is no need to call connection::process.
This should make shedding use slightly less resources.
2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz
f7e5adaca3 transport: generic_server: remove no longer used connection advertising code 2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz
81f0e79dc0 transport: move new connection trace logs into connection class ctor/dtor
This is a step towards replacing advertise_new_connection/unadvertise_connection
by RAII which is less error prone. Advertising will be removed in subsequent commit.
2025-05-27 19:30:56 +02:00
Marcin Maliszkiewicz
371b959539 transport: move cql connections counting into connection class ctor/dtor
This is a step towards replacing advertise_new_connection/unadvertise_connection
by RAII which is less error prone. Advertising will be removed in subsequent commit.
2025-05-27 19:30:39 +02:00
Dawid Mędrek
c60035cbf6 test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.
2025-05-27 18:53:39 +02:00
Dawid Mędrek
237638f4d3 test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the file verify more subtle parts of the behavior
of tablets and rely on topology layouts or using keyspaces that violate
the invariant the `rf_rack_valid_keyspaces` configuration option is
trying to enforce. Because of that, we explicitly disable the option
to be able to enable it by default in the rest of the test suite in
the following commit.
2025-05-27 18:53:36 +02:00
Anna Stuchlik
efce03ef43 doc: clarify RF increase issues for tablets vs. vnodes
This commit updates the guidelines for increasing the Replication Factor
depending on whether tablets are enabled or disabled.

To present it in a clear way, I've reorganized the page.

Fixes https://github.com/scylladb/scylladb/issues/23667

Closes scylladb/scylladb#24221
2025-05-27 17:47:50 +02:00
Dawid Mędrek
22d6c7e702 test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load 2025-05-27 16:01:14 +02:00
Dawid Mędrek
fa62f68a57 test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
We make sure that the keyspaces created in the test are always RF-rack-valid.
To achieve that, we change how the test is performed.

Before this commit, we first created a cluster and then ran the actual test
logic multiple times. Each of those test cases created a keyspace with a random
replication factor.

That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify
the property file of a node (see commit: eb5b52f598),
so once we set up the cluster, we cannot adjust its layout to work with another
replication factor.

To solve that issue, we also recreate the cluster in each test case. Now we choose
the replication factor at random, create a cluster distributing nodes across as many
racks as RF, and perform the rest of the logic. We perform it multiple times in
a loop so that the test behaves as before these changes.
2025-05-27 15:52:38 +02:00
Dawid Mędrek
cd615c3ef7 test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
We distribute the nodes used in the test across two racks so we can
run the test with `rf_rack_valid_keyspaces` set to true.

We want to avoid cross-rack migrations and keep the test as realistic
as possible. Since host3 is supposed to function as a new node in the
cluster, we change the layout of it: now, host1 has 2 shards and resides
in a separate rack. Most of the remaining test logic is preserved and behaves
as before this commit.

There is a slight difference in the tablet migrations. Before the commit,
we were migrating a tablet between nodes of different shard counts. Now
it's impossible because it would force us to migrate tablets between racks.
However, since the test wants to simply verify that an ongoing migration
doesn't interfere with load balancing and still leads to a perfect balance,
that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3,
so to achieve the goal, one more tablet needs to be migrated, and we test
that.
2025-05-27 15:41:27 +02:00
Ferenc Szili
1f9f724441 test: add reproducer and test for mutation source refresh after merge
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.
2025-05-27 15:18:36 +02:00
Ferenc Szili
d0329ca370 tablets: trigger mutation source refresh on tablet count change
Consider the following scenario:

- let's assume tablet 0 has range [1, 5] (pre merge)
- tablet merge happens, tablet 0 has now range [1, 10]
- tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet
  0 still has range [1, 5]
- during a full scan, forward service will intersect the full range with
  tablet ranges and consume one tablet at a time
- replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly
   all sstables intersecting with range [1, 10]

With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to
   range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable,
     EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.

So with the set refreshed, sstable set is aligned with tablet ranges, and no
premature EOS is signalled, otherwise preventing fast forward to from
happening and all data from being properly captured in the read.

This change fixes the bug and triggeres a mutation source refresh whenever
the number of tablets for the table has changed, not only when we have
incoming tablets.

Fixes: #23313
2025-05-27 15:15:43 +02:00
Wojciech Mitros
5074daf1b7 test: actually wait for tablets to distribute across nodes
In test_tablet_mv_replica_pairing_during_replace, after we create
the tables, we want to wait for their tablets to distribute evenly
across nodes and we have a wait_for for that.
But we don't await this wait_for, so it's a no-op. This patch fixes
it by adding the missing await.

Refs scylladb/scylladb#23982
Refs scylladb/scylladb#23997

Closes scylladb/scylladb#24250
2025-05-27 15:12:25 +02:00
Avi Kivity
844a49ed6e dht: fragment token_range_vector
token_range_vector is a linear vector containing intervals
of tokens. It can grow quite large in certain places
and so cause stalls.

Convert it to utils::chunked_vector, which prevents allocation
stalls.

It is not used in any hot path, as it usually describes
vnodes or similar things.

Fixes #3335.
2025-05-27 14:47:24 +03:00
Avi Kivity
83c2a2e169 partition_range_compat: generalize wrap/unwrap helpers
These helpers convert vectors of wrapped intervals to
vectors of unwrapped intervals and vice versa.

Generalize them to work on any sequence type. This is in
preparation of moving from vectors to chunked_vectors.
2025-05-27 14:47:21 +03:00
Botond Dénes
542b2ed0de Merge 'Remove req_params facility from API' from Pavel Emelyanov
The class was introduced to facilitate path and query parameters parsing from requests, but in fact it's mostly dead code.

First, the class introduces the concept of  "mandatory" parameters which are seastar path params. If missing, the parameter validation throws, but in all cases where this option is used in scylla it's impossible to get empty path param -- if the parameter is missing seastar returns 404 (not found) before calling handler.

Second, the req_params::get<T>() doesn't work for anything but string argument (or types such that optional<T> can be implicitly casted to optional<sstring>). And it's in fact only used to get sstrings, so it compiles and works so far.

The remaining ability to parse bool from string is partially duplicated by the validate_bool() method. Using plain method to parse string to bool is less code than req_params introduce.

One (arguably) useful thing req_params do it validate the incoming request _not_ to contain unknown query parameters. However, quite a few endpoints use this, most of them just cherry-pick parameters they want and ignore the others. There's already a comprehensive description of accepted parameters for each endpoint in api-doc/ and req_params duplicate it. Good validation code should rely on api-doc/, not on its partial copy.

Having said that, this PR introduces validate_bool_x() helper to do req_params-like parsing of strings to bools, patches existing handlers to use existing parameters parsing facilities (such as validate_keyspace() and parse_table_infos()) and drops the req_params.

Closes scylladb/scylladb#24159

* github.com:scylladb/scylladb:
  api: Drop class req_params
  api: Stop using req_params in parse_scrub_options
  api: Stop using req_params in tasks::force_keyspace_compaction_async
  api: Stop using req_params in ss::force_keyspace_compaction
  api: Stop using req_params in ss::force_compaction
  api: Stop using req_params in cf::force_major_compaction
  api: Add validate_bool_x() helper
2025-05-27 14:29:05 +03:00
Ernest Zaslavsky
7d0d3ec1c8 load_and_stream: Add abortion flow to mutation streaming
* The new abort command explicitly represents the abortion flow in
mutation streaming, clearly identifying operations that are
intentionally aborted. This reduces ambiguity around failures in
streaming operations.
* In the error-handling section, aborted operations are now
explicitly marked as the cause of the streaming failure. This allows
us to differentiate them from genuine errors and appropriately adjust
log severity to reduce unnecessary alarm caused by aborted streaming
failures.
* To avoid alarming users with excessive error logs, log severity for
streaming failures caused by aborted operations has been downgraded.
This helps keep logs cleaner and prevents unnecessary concerns.
* A new feature has been added to ensure mixed clusters during updates
do not receive unsupported RPC messages, improving compatibility and
stability.

fixes: https://github.com/scylladb/scylladb/issues/23076

Closes scylladb/scylladb#23214
2025-05-27 14:21:58 +03:00
Dawid Mędrek
1199c68bac test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
We assign the nodes created by the test to separate racks. It has no impact
on the test since the keyspace used in the test uses RF=2, so the tablet
replicas will still be the same.
2025-05-27 13:18:11 +02:00
Dawid Mędrek
e4e3b9c3a1 test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
We distribute the nodes used in the test between two racks. Although
that may affect how tablets behave in general, this change will not
have any real impact on the test. The test verifies that load balancing
eventually balances tablets in the cluster, which will still happen.
Because of that, the changes in this commit are safe to apply.
2025-05-27 13:18:09 +02:00
Dawid Mędrek
6e2fb79152 test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
We distribute the nodes used in the test between two racks. Although that
may have an impact on how tablets behave, it's orthogonal to what the test
verifies -- whether the topology coordinator is continuously in the tablet
migration track. Because of that, it's safe to make this change without
influencing the test.
2025-05-27 13:18:07 +02:00
Botond Dénes
485df63fd5 Merge 'Extend compaction_history table with additional compaction statistics' from Łukasz Paszkowski
Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction.

The series extends the table with the following columns:

-  "compaction_type" (text)
- "shard_id" (int)
- "sstables_in" (list<sstableinfo_type>)
- "sstables_out" (list<sstableinfo_type>)
- "total_tombstone_purge_attempt" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)

with a user defined type `sstableinfo_type` that holds the information about sstable file

- generation (uuid)
- origin (text)
- size (long)

Additional statistics stored in the compaction_history have been incorporated in the API  `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command.

No backport is required. It extends the existing compaction history output.

Fixes https://github.com/scylladb/scylladb/issues/3791

Closes scylladb/scylladb#21288

* github.com:scylladb/scylladb:
  nodetool: Refactor of compactionhistory_operation
  nodetool: Add more stats into compactionhistory output
  api/compaction_manager: Extend compaction_history api
  compaction: Collect tombstone purge stats during compaction
  compacting_reader: Extend to accept tombstone purge statistics
  mutation_compactor: Collect tombstone purge attempts
  compaction_garbage_collector: Extend return type of max_purgeable_fn
  compaction: Extend compaction_result to collect more information
  system_keyspace: Upgrade compaction_history table
  system_keyspace: Create UDT: sstableinfo_type
  system_keyspace: Extract compaction_history struct
  system_keyspace: Squeeze update_compaction_history parameters
  compaction/compaction_manager: update_history accepts compaction_result as rvalue
2025-05-27 14:12:13 +03:00
Anna Stuchlik
b197d1a617 doc: update migration tools overview
This commit updates the migration overview page:

- It removes the info about migration from SSTable to CQL.
- It updates the link to the migrator docs.

Fixes https://github.com/scylladb/scylladb/issues/24247

Refs https://github.com/scylladb/scylladb/pull/21775

Closes scylladb/scylladb#24258
2025-05-27 14:07:35 +03:00
Michał Chojnowski
185a032044 utils/stream_compressor: allocate memory for zstd compressors externally
The default and recommended way to use zstd compressors is to let
zstd allocate and free memory for compressors on its own.

That's what we did for zstd compressors used in RPC compression.
But it turns out that it generates allocation patterns we dislike.

We expected zstd not to generate allocations after the context object
is initialized, but it turns out that it tries to downsize the context
sometimes (by reallocation). We don't want that because the allocations
generated by zstd are large (1 MiB with the parameters we use),
so repeating them periodically stresses the reclaimer.

We can avoid this by using the "static context" API of zstd,
in which the memory for context is allocated manually by the user
of the library. In this mode, zstd doesn't allocate anything
on its own.

The implementation details of this patch adds a consideration for
forward compatibility: later versions of Scylla can't use a
window size greater than the one we hardcoded in this patch
when talking to the old version of the decompressor.

(This is not a problem, since those compressors are only used
for RPC compression at the moment, where cross-version communication
can be prevented by bumping COMPRESSOR_NAME. But it's something
that the developer who changes the window size must _remember_ to do).

Fixes #24160
Fixes #24183

Closes scylladb/scylladb#24161
2025-05-27 12:43:11 +03:00
Jenkins Promoter
76dddb758e Update pgo profiles - x86_64 2025-05-27 12:02:49 +03:00
Pavel Emelyanov
bd3bd089e1 sstables_loader: Fix load-and-stream vs skip-cleanup check
The intention was to fail the REST API call in case --skip-cleanup is
requested for --load-and-stream loading. The corresponding if expression
is checking something else :( despite log message is correct.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24208
2025-05-27 12:01:01 +03:00
Jenkins Promoter
de9d9c9ece Update pgo profiles - aarch64 2025-05-27 11:59:56 +03:00
Andrzej Jackowski
555d897a15 test: wait for normal state propagation in test_auth_v2_migration
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.

In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.

To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.

Refs: #24163

Closes scylladb/scylladb#24185
2025-05-27 11:41:09 +03:00
Nikos Dragazis
eaa2ce1bb5 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872
2025-05-27 11:26:35 +03:00
Botond Dénes
2739eb49fd Merge 'docs: remove API reference redirect' from David Garcia
Fix for https://github.com/scylladb/scylladb/pull/24097

The stable branch does not contain the split API reference yet. This change fixes the 404 error raised when accessing the API reference on the stable branch due to the redirect.

Closes scylladb/scylladb#24259

* github.com:scylladb/scylladb:
  docs: fix typo
  docs: remove API reference redirect
2025-05-27 11:24:27 +03:00
Nadav Har'El
8487d81c6e Merge 'test: mark difference in handling IFs in LWT as scylla_only' from Andrzej Jackowski
There is a difference how ScyllaDB and Cassandra handle conditional
batches with different IF statements (such as "IF EXISTS" and "IF NOT
EXISTS"). Cassandra tries to detect condition conflicts, and prints
an error instead of silently failing the batch, but in ScyllaDB
we considered this check to be inconsistent and unhelpful, and
decided not to implement it.

In this series, we extend the documentation of the ScyllaDB behaviour
by extending the documents and improving relevant LWT tests.

Fixes: https://github.com/scylladb/scylladb/issues/13011

Backport not needed, only docs and minor tests changes.

Closes scylladb/scylladb#24086

* github.com:scylladb/scylladb:
  test: mark difference in handling IFs in LWT as scylla_only
  docs: cql: add explicit explanation how mixing IFs works in LWT
  docs: lwt: add two missing spaces
2025-05-27 09:35:41 +03:00
Evgeniy Naydanov
efdb2abdc6 test.py: dtest: make bypass_cache_test.py run using test.py
As a part of the porting process, copy missed utility functions from scylla-dtest,
remove unused imports and markers, and add single_node marker description to pytest.ini

Enable the test in suite.yaml (run in dev mode only)
2025-05-27 05:48:26 +00:00
Evgeniy Naydanov
3a2410324c test.py: dtest: add missed ScyllaCluster.nodetool()
The method executes nodetool command on each running node in a cluster.
2025-05-27 05:48:26 +00:00
Evgeniy Naydanov
6105bb9530 test.py: dtest: copy unmodified bypass_cache_test.py
Test is disabled in suite.yaml
2025-05-27 05:48:26 +00:00
Andrzej Jackowski
7dc0c4cf4f test: close logfile/socket_dir for stopped servers in recycle_cluster
PythonTestSuite::recycle_cluster is a function that releases resources
of an old, dirty cluster to make it reusable. It closes log_file and
maintenance_socket_dir for running nodes in a dirty cluster, however it
doesn't do the same for stopped nodes. It leads to leakage of file
descriptors of stopped nodes, which in turn can lead to hitting ulimit
of open files (that is often 1024) if the leaking test is repeated with
`./test.py --repeat ...`. The problem was detected when tests from
`test/cluster/dtest/` directory were executed with high `repeat` value.

This commit extends `recycle_cluster` to close and cleanup logfile and
`socket_dir` for nodes that are stopped (because self.servers in
ScyllaCluster is ChainMap of self.running and self.stopped).

Closes scylladb/scylladb#24243
2025-05-27 08:37:43 +03:00
David Garcia
d99d1c315c docs: remove [erno X] prefix from metrics logger
Closes scylladb/scylladb#24246
2025-05-27 08:37:11 +03:00
David Garcia
3e331cfbbe docs: fix typo 2025-05-26 21:34:23 +02:00
David Garcia
eefc9c33e8 docs: remove API reference redirect
The stable branch does not contain the split API reference yet.
This change fixes the 404 error raised when accessing the API reference on the stable branch.
2025-05-26 21:32:07 +02:00
Andrzej Jackowski
ea6ef5d0aa test: mark difference in handling IFs in LWT as scylla_only
There is a difference how ScyllaDB and Cassandra handle conditional
batches with different IF statements (such as "IF EXISTS" and "IF NOT
EXISTS"). Cassandra tries to detect condition conflicts, and prints
an error instead of silently failing the batch, but in ScyllaDB
we considered this check to be inconsistent and unhelpful, and
decided not to implement it.

This commit:
 - Make test_lwt_with_batch_conflict_1 scylla_only instead of xfail,
   change the scenario to pass with the current implementation.
 - Add test_lwt_with_batch_conflict_3 that shows how Cassandra fails
   batch statement with different conditions, even when the conditions
   are not contradictory.
 - Add test_lwt_with_batch_conflict_4/5 that shows how static rows
   are handled in conditional batches.

Fixes: #13011
2025-05-26 15:47:11 +02:00
Andrzej Jackowski
2d4acb623e docs: cql: add explicit explanation how mixing IFs works in LWT
There is a difference how ScyllaDB and Cassandra handle conditional
batches with different IF statements (such as "IF EXISTS" and "IF NOT
EXISTS").

This commit explicitly documents the differences in the behavior.

Refs: #13011
2025-05-26 15:13:01 +02:00
Piotr Dulikowski
4508823294 Merge 'test.py: dtest: few fixes missed in the initial implementation' from Evgeniy Naydanov
There are few problems found in the dtest shim code after scylladb/scylladb#21580 was merged:

- The call of `init_default_config()` method was missed in scylladb/scylladb#21580.  It is required to handle dtest options and markers.
- The implementation of dtest shim uses `server_id` to format a name of a node in a cluster. This is a difference in behavior with dtest. Some of dtests use code like `cluster.nodes()["node1"]` to get access to a node object.
- Default timeout was missed in `ScyllaNode.wait_until_stopped()` method. Set it to 600 for debug mode or to 127 otherwise.

Closes scylladb/scylladb#24225

* github.com:scylladb/scylladb:
  test.py: dtest: set default wait_seconds based on build mode
  test.py: dtest: name nodes in cluster using index starting from 1
  test.py: dtest: initialize default config in dtest setup fixture
2025-05-26 13:37:12 +02:00
Yaron Kaikov
89ace09c18 [workflow]: add conflict_reminder to PRs based against master
Today we send a reminder to PR's author when backport PRs has conflicts.
Often, PR authors wait for their PR to be reviewed/merged, but the merge is not happening because the PR now conflicts with master and so maintainers won't merge it. This can lead to a stall, where maintainers wait for the author to rebase and authors are waiting for merge.

In this PR we added the ability to notify the PR author as soon as base
branch moved forward and rebase is requried

Fixes: https://github.com/scylladb/scylla-pkg/issues/4955

Closes scylladb/scylladb#24209
2025-05-26 14:30:06 +03:00
David Garcia
6f722e8bc0 docs: split api reference in smaller files
Closes scylladb/scylladb#24097
2025-05-26 12:06:59 +03:00
Radosław Cybulski
90ebea5ebb Move mutation_fragment::kind into data object
Move `mutation_fragment::kind` enum into data object, reducing size
of the object from 16 to 8 bytes on current machines.
2025-05-26 11:06:54 +02:00
Radosław Cybulski
ef51bb9bd3 Make mutation_fragment::kind enum 1 byte size
Adds std::uint8_t base to `Make mutation_fragment_v2::kind` making it
one byte size.
2025-05-26 11:06:54 +02:00
Radosław Cybulski
003e79ac9e Move mutation_fragment_v2::kind into data object
Move `mutation_fragment_v2::kind` enum into data object, reducing size
of the object from 16 to 8 bytes on current machines.
2025-05-26 11:06:53 +02:00
Radosław Cybulski
d211119e49 Make mutation_fragment_v2::kind enum 1 byte size
Add std::uint8_t as base to `mutation_fragment_v2::kind` enum,
which will resize it to 1 byte.
2025-05-26 11:06:53 +02:00
David Garcia
bf9534e2b5 docs: fix \t (tab) is not rendered correctly
Closes scylladb/scylladb#24096
2025-05-26 12:06:03 +03:00
Avi Kivity
29932a5af1 pgo: drop Java configuration
Since 5e1cf90a51
("build: replace tools/java submodule with packaged cassandra-stress")
we run pre-packaged cassandra-stress. As such, we don't need to look for
a Java runtime (which is missing on the frozen toolchain) and can
rely on the cassandra-stress package finding its own Java runtime.

Fix by just dropping all the Java-finding stuff.

Note: Java 11 is in fact present on the frozen toolchain, just
not in a way that pgo.py can find it.

Fixes #24176.

Closes scylladb/scylladb#24178
2025-05-26 10:16:03 +02:00
Avi Kivity
f195c05b0d untyped_result_set: mark get_blob() as returning unfragmented data
Blobs can be large, and unfragmented blobs can easily exceed 128k
(as seen in #23903). Rename get_blob() to get_blob_unfragmented()
to warn users.

Note that most uses are fine as the blobs are really short strings.

Closes scylladb/scylladb#24102
2025-05-26 09:40:34 +02:00
Michał Chojnowski
ff8a119f26 test/boost/sstable_compressor_factory_test: define a test suite name
It seems that tests in test/boost/combined_tests have to define a test
suite name, otherwise they aren't picked up by test.py.

Fixes #24199

Closes scylladb/scylladb#24200
2025-05-26 09:35:30 +02:00
Anna Stuchlik
d303edbc39 doc: remove copyright from Cassandra Stress
This commit removes the Apache copyright note from the Cassandra Stress page.

It's a follow up to https://github.com/scylladb/scylladb/pull/21723, which missed
that update (see https://github.com/scylladb/scylladb/pull/21723#discussion_r1944357143).

Cassandra Stress is a separate tool with separate repo with the docs, so the copyright
information on the page is incorrect.

Fixes https://github.com/scylladb/scylladb/issues/23240

Closes scylladb/scylladb#24219
2025-05-26 09:35:30 +02:00
Pavel Emelyanov
2a253ace5e Merge 'test.py: add coverage for boost with pytest execution' from Andrei Chekun
This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners.

Closes scylladb/scylladb#24236

* github.com:scylladb/scylladb:
  test.py: add support for coverage for boost test
  test.py: get the temp dir from facade
2025-05-26 10:18:53 +03:00
Andrei Chekun
537054bfad test.py: add support for coverage for boost test
This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners.
2025-05-23 12:54:54 +02:00
Andrei Chekun
c5a7f3415c test.py: get the temp dir from facade
No need to get the temp dir from the options when facade has this information already.
2025-05-23 12:54:48 +02:00
Nadav Har'El
d2844055ad Merge 'index: implement schema management layer for vector search indexes' from null
This pull request adds support for creating custom indexes (at a metadata level) as long as a supported custom class is provided (currently only vector search).

The patch contains:

- a change in CREATE INDEX statement that allows for the USING keyword to be present as long as one of the supported classes is used
-  support for describing custom indexes in the DESCRIBE statement
- unit tests

Co-authored by: @Balwancia

Closes scylladb/scylladb#23720

* github.com:scylladb/scylladb:
  test/cqlpy: add custom index tests
  index: support storing metadata for custom indices
2025-05-22 12:19:36 +03:00
Pavel Emelyanov
a0d2e63303 Merge 'test.py: add the possibility to gather resource metrics for C++ tests' from Andrei Chekun
Move the run_process method to resource gather instance, since we need to start a monitor to check memory consumption in the cgroup. Pytest has concept of the test, but it is completely different from test.py. Resource gather instance take test instance to save and extract information about the test. Additional method emulating test.py test instance added not to rewrite the resource gather instance. Finally, combining all these changes to have ability to get metrics for test in both runners: test.py and pytest.

Closes scylladb/scylladb#24091

* github.com:scylladb/scylladb:
  test.py: add missing parameter for boost tests for pytest runner
  test.py: add support for boost_data_test_case in combined tests
  test.py: clean log files after a successful run
  test.py: attach output of the boost test to the report
  test.py: fix metrics DB location
  test.py: move run_process to resource_gather.py
  test.py: unify using constant for finding repo root directory
  test.py: refactor run_process in facade.py
  test.py: add the possibility to create a test alike object
2025-05-22 10:34:34 +03:00
Evgeniy Naydanov
8dc5413f54 test.py: dtest: set default wait_seconds based on build mode
Default timeout was missed in `ScyllaNode.wait_until_stopped()` method.
Set it to 600 for debug mode or to 127 otherwise.
2025-05-22 06:39:03 +00:00
Evgeniy Naydanov
eca5d52f1d test.py: dtest: name nodes in cluster using index starting from 1
The current implementation of dtest shim use `server_id` to format a
name of a node in a cluster. This is a difference in behavior with dtest.
Some of dtests use code like `cluster.nodes()["node1"]` to get access
to a node object.  This commit changes it to be more consistent with
dtest.
2025-05-22 06:34:03 +00:00
Evgeniy Naydanov
91e29a302a test.py: dtest: initialize default config in dtest setup fixture
The call of `init_default_config()` method was missed in #21580.
It is required to handle dtest options and markers.
2025-05-22 06:22:04 +00:00
Andrei Chekun
8812b14078 test.py: add missing parameter for boost tests for pytest runner
Since we are running tests with a pytest, we don't need a report at the end of the run.
2025-05-21 19:41:41 +02:00
Andrei Chekun
66b014621e test.py: add support for boost_data_test_case in combined tests
Change the parsing logic of combined tests to support a case when boost_data_test_case used that produced additional lines in the output.
2025-05-21 19:41:41 +02:00
Andrei Chekun
88d24d8ad5 test.py: clean log files after a successful run
Clean different output files from the boost and unit tests.
Move logs for boost test to the testlog directory instead of having additional directory pytest
2025-05-21 19:41:41 +02:00
Andrei Chekun
a956dd8770 test.py: attach output of the boost test to the report
Added attaching the output of the test in case of fail to the Allure report
2025-05-21 19:41:39 +02:00
Andrei Chekun
ac86cc9f6d test.py: fix metrics DB location
Fix the issue introduced with scylladb/scylladb#22960. Suite log dir was changed, and the path for metrics DB was relying on it. As a result, DB is now located in the mode directory instead of the root of the testlog.
2025-05-21 15:37:15 +02:00
Andrei Chekun
b5b69710bd test.py: move run_process to resource_gather.py
Move the run_process method to the resource gather instance, since we need to start monitor to check memory consumption in the cgroup. Since resource_gather needs test.py test object, and pytest has no clue about it, adding a simple namespace object to emulate such a test object. It needed only to gather some information regarding the test to be able to add records to the DB.
Since we have two facades that can share the same run process procedure, adding a common method to handle this to avoid code duplication.
2025-05-21 15:34:34 +02:00
Andrei Chekun
3bcd6db718 test.py: unify using constant for finding repo root directory
Instead of finding dynamically the repo root directory relatively to the temp dir, that's in most cases in the repo, will fail if a non-default temp dir parameter is used. Additionally, to have the single source of truth of finding the repo root directory switching to the constants.
2025-05-21 15:34:34 +02:00
Andrei Chekun
4e18444831 test.py: refactor run_process in facade.py
Add injecting environment variables to the process
Switch from print to propper logger
Set buffer size to 1 to avoid losing any data from the boost test if the test collapsed.
Currently, run process logs and return stdout and stderr, but boost tests are using stderr only. So stderr redirected to stdout. This helps with Jenkins as well, since we are reducing the number of files to store.
2025-05-21 15:34:34 +02:00
Andrei Chekun
38310975c5 test.py: add the possibility to create a test alike object
resource_gather.py needs test.py test object to work. It needs some information about the test to be able to write down this information to the DB with metrics. When running with pytest, there's no such test object, that's why adding make_test_object to mimic the test.py's test object.
Switching the getting the mode for constructing path to chgroup to test
instead of suite. They are the same, but this helps to have emulate less
in make_test_object method.
2025-05-21 15:34:34 +02:00
Pavel Emelyanov
dac7589cef Revert "encryption_test: Catch exact exception"
This reverts commit 2d5c0f0cfd.

KMS tests became flaky after it: #24218
Need to revisit.
2025-05-20 13:52:14 +03:00
Petr Gusev
0443081b0d build: fix merge-compdb.py for CMake 'output' attributes
compile_commands.json is used by LSPs (e.g. `clangd` in VS Code) for
code navigation. `merge-compdb.py`, called by `configure.py`, merges
these files from Scylla, Seastar, and Abseil. The script filters
entries by checking the output attribute against a given prefix. This
is needed because Scylla’s compile_commands.json is generated by Ninja
and includes all build modes, in case the user specified multiple
ones in the call to configure.py. Seastar and Abseil databases,
generated by CMake, used to omit the output attribute, so filtering
did not apply. Starting with `CMake 3.20+`, output attributes are now
included and do not match the expected prefix. For example, they
could be of the form
`absl/synchronization/CMakeFiles/synchronization.dir/internal/futex_waiter.cc.o`.
This causes relevant entries from Seastar and Abseil to be filtered out.

This patch refactors `merge-compdb.py` to allow specifying an
optional prefix per input file, preserving the intent of applying
the output filtering logic only for ninja-generated
Scylla compdb file.

Closes scylladb/scylladb#24211
2025-05-20 08:43:09 +03:00
Piotr Dulikowski
c15cf54e3d Merge 'test.py: migrate alternator_tests.py from dtest suite' from Evgeniy Naydanov
We have a significant amount of tests in scylla-dtest repository and I believe most of them can be just copied to test.py framework with adding a relatively small shim code. In this PR I done that for 2 tests: [alternator_tests.py](https://github.com/scylladb/scylla-dtest/blob/next/alternator_tests.py) and [error_example_test.py](https://github.com/scylladb/scylla-dtest/blob/next/error_example_test.py)

One of the problems is async nature of test.py framework and synchronous of scylla-dtest. It was resolved by using universalasync third-party library. Other problem is ccmlib and it's resolved by adding a shim code (`test/dtest/ccmlib`)

ccmlib has a lot of dead code and not all it's features used by scylla-dtest, in this PR I added checks that we will not accidentally use some of them or miss something. And when we'll done the migration we can easily remove all unused parameters and these checks.

`error_example_test.py` copied as is (just license preamble added), `alternator_tests.py` has small changes:

1. License preamble
2. Remove unused imports
3. Remove unneeded `skip_if` marker (I think it can be backported to dtest, or we can remove the test from dtest after merging this PR)

```diff
--- ../../../scylla-dtest/alternator_tests.py
+++ alternator_tests.py
@@ -1,17 +1,20 @@
+#
+# Copyright (C) 2025-present ScyllaDB
+#
+# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
+#
+
 import logging
 import operator
 import os
 import random
-import shutil
 import string
-import subprocess
 import tempfile
 import time
 from ast import literal_eval
 from concurrent.futures.thread import ThreadPoolExecutor
 from copy import deepcopy
 from decimal import Decimal
-from pathlib import Path
 from pprint import pformat

 import boto3.dynamodb.types
@@ -46,7 +49,6 @@
 )
 from dtest_class import get_ip_from_node, wait_for
 from tools.cluster import new_node
-from tools.marks import issue_open, with_feature
 from tools.misc import set_trace_probability
 from tools.retrying import retrying

@@ -168,7 +170,6 @@
         read_and_delete_set_elements_thread.join()

     @pytest.mark.next_gating
-    @pytest.mark.skip_if(with_feature("tablets") & issue_open("#18002"))
     def test_decommission_during_dynamo_load(self):
         self.prepare_dynamodb_cluster(num_of_nodes=3)
         node1, node2, node3 = self.cluster.nodelist()
```

Because all tests in this repo are considered to be "gating",  I removed all not next_gating tests and all dtest's suites markers as a separate commit.

To reduce tests execution time run the tests in dev mode only and made some sleeps smaller.

In result, 23 tests added in total (22 in `test_alternator.py` and 1 in `test_error_example`.)  The added tests will increase CI time by ~2х4 =8 minutes.

Closes scylladb/scylladb#21580

* github.com:scylladb/scylladb:
  test.py: dtest/alternator_tests.py: make sleep intervals smaller
  test.py: dtest/alternator_tests.py: remove not next_gating tests
  test.py: migrate alternator_tests.py from dtest
  test.py: initial implementation of dtest/ccm shim
  test.py: manager: add server_get_returncode() method
  test.py: manager: change CLI and env options on a node start
  test.py: REST API: add set_trace_probability() method
  test.py: REST API: add get_tokens() method
  test.py: rework log_browsing for dtest migration
2025-05-20 00:13:16 +02:00
Evgeniy Naydanov
e456f0ed7b test.py: dtest/alternator_tests.py: make sleep intervals smaller 2025-05-19 12:27:32 +00:00
Evgeniy Naydanov
8dd86818a0 test.py: dtest/alternator_tests.py: remove not next_gating tests
Remove all not next_gating tests and remove any dtest suites markers
because all tests in this repo are considered to be "gating".
2025-05-19 12:27:32 +00:00
Evgeniy Naydanov
57c1035146 test.py: migrate alternator_tests.py from dtest
The test almost unmodified except remove unneeded skipif mark
and unused imports.
2025-05-19 12:27:32 +00:00
Evgeniy Naydanov
ac1551892b test.py: initial implementation of dtest/ccm shim
Use universalasync library to make test.py async code compatible
with synchronous code of dtest/ccm

Also, copied unmodified error_example_test.py from dtest as an example.

Run the test in `dev` mode only.
2025-05-19 12:27:31 +00:00
Evgeniy Naydanov
2cb640f95c test.py: manager: add server_get_returncode() method
The method return None if Scylla process is still running or returncode.
If there is no Scylla process launched then raise NoSuchProcess exception.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
d874beb17f test.py: manager: change CLI and env options on a node start
Add parameters to server_start() method to provide ability to
change Scylla' CLI and env options on a node start.

Also, add `expected_server_up_state` parameter as we have for
server_add() method.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
5d3b54aa9b test.py: REST API: add set_trace_probability() method 2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
a16a4b6171 test.py: REST API: add get_tokens() method
Get a list of the tokens for the specified node.
Optional `endpoint` parameter can be provided.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
f6e3fdd778 test.py: rework log_browsing for dtest migration
Rework `ScyllaLogFile.wait_for()` method to make it easier
to add required methods to ScyllaNode class of ccm-like shim.

Also, added `ScyllaLogFile.grep_for_errors()` method and
reworked `ScyllaLogFile.grep()`
2025-05-19 11:50:55 +00:00
Łukasz Paszkowski
0a2f0c6852 nodetool: Refactor of compactionhistory_operation
Simplify code by using std::apply that unpacks std::array into
separate items to pass further to a callable. This simplifies
the code that looks:

fmt::print(std::cout, fmt::runtime(header_row_format.c_str()),
           header_row[0], header_row[1], header_row[2], header_row[3],
           header_row[4], header_row[5], header_row[6], header_row[7],
           header_row[8], header_row[9], header_row[10], header_row[11],
           header_row[12], header_row[13]);

into something like:
std::apply(fh, header_row);
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
edb666f461 nodetool: Add more stats into compactionhistory output
Incorporate additional statistics stored in the compaction_history
system table. Depending on the requested format type, the output has
different form.

Remove unnecessary duplicated history_entry struct and instead use
extracted db::compaction_history_entry structure.

Running the cql command: select * from system.compaction_history;
prints sstable's generation type as UUID (e.g. 5a5cf800-b617-11ef-a97d-8438c36f0e31),
see generation_type::data_value() which is different than its fmt
format (e.g. 3glx_0srx_1pasg2ksepk902v8dt). Therefore, to unify
the outputs, generation_type is converted to data_value before
it is printed.
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
583cc675ce api/compaction_manager: Extend compaction_history api
Extend api of /compaction_manager/compaction_history to include
newly added columns to the compaction history table from the previous
patches.
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
2793369288 compaction: Collect tombstone purge stats during compaction
Collect tombstone purge statistics like
+ total number of purge attempts
+ number of purge failures due to data overlapping with memtables
+ number of purge failures due to data overlapping with non-compacting
  sstables

and expose them in the compaction_stats structure.
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
6b729fabc9 compacting_reader: Extend to accept tombstone purge statistics
Extends the make_compacting_reader funtion and the constructor of
the compacting_reader, in order to accept an optional pointer to
the tombstone purge statistics structure that is later passed
further down to compact_mutation_state.
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
546b2c191f mutation_compactor: Collect tombstone purge attempts
Let compact_mutation_state collect all tombstone purge attempts
and failures. For this purpose a new statistic structure is created
(tombstone_purge_stats) and the relative stats are collected in
the can_purge_tombstone method.

The statistics are collect only for sstables compaction.

An optional statistics structure can be passed in via compact_mutation_state
constructor.
2025-05-16 20:00:00 +02:00
Łukasz Paszkowski
503d4f014c compaction_garbage_collector: Extend return type of max_purgeable_fn
Currently, when a max purgeable timestamp is computed, there is no
information where it comes from and how the value was obtained.
Take compaction, if there are memtables or other uncompacting sstables
possibly shadowing data, the timestamp is decreased to ensure a
tombstone is not purged but the caller does not know what that the
timestamp has its value.

In this patch, we extend the return type of max_purgeable_fn to
contain not only a timestamp but also an information on how it was
computed. This information will be required to collect statistics
on tombstone purge failures due to overlapping memtables/uncompacting
sstables that come later in the series.
2025-05-16 19:59:54 +02:00
Anna Stuchlik
2d7db0867c doc: fix the product name for version 2025.1
Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise",
but the OS support page still uses that label.
This commit fixes that by replacing "Enterprise" with "ScyllaDB".

This update is required since we've removed "Enterprise" from everywhere else,
including the commands, so having it here is confusing.

Fixes https://github.com/scylladb/scylladb/issues/24179

Closes scylladb/scylladb#24181
2025-05-16 12:16:00 +02:00
Avi Kivity
37f9cf6de6 dist: rpm: override %_sbindir for Fedora 42
Fedora 42 merged /usr/sbin into /usr/bin [1]. As part of that change
the rpm macro %_sbindir was redefined from /usr/sbin to /usr/bin. As
a result RPM build on Fedora 42 fails: install.sh places some files
into /usr/sbin, while rpmbuild looks for them in /usr/bin.

We could resolve this either by following the change and moving
the files to /usr/bin as well, or fixing the spec to place the files
in /usr/sbin. The former is more difficult:
 - what about Debian/Ubuntu?
 - what about older RPM-based distributions (like all RHEL distributions)?
 - what about scripts that hard-code /usr/sbin/<scylla utility>?

So we pick the latter, and redefine %_sbindir to /usr/sbin. Since that
directory still exists (as a symlink), installation on systems with
merged /usr/bin and /usr/sbin will work.

We'll have to address the problem later (likely by installing to either
/usr/bin or /usr/sbin depending on context), but for now, this is a simple
solution that works everywhere.

[1] https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin

Closes scylladb/scylladb#24101
2025-05-16 12:05:29 +02:00
Aleksandra Martyniuk
9c03255fd2 cql_test_env: main: move stream_manager initialization
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.

Move stream_manager initialization over the storage_service initialization.

Fixes: #23207.

Closes scylladb/scylladb#24008
2025-05-15 17:17:35 +03:00
Avi Kivity
4f87362abb compaction_manager: drop gratuitous conversion from interval to wrapped_interval
The conversion is unnecessary and likely dates back from before the
split between interval and wrapped_interval. It gets in the way
of making the conversion explicit.

Closes scylladb/scylladb#24164
2025-05-15 16:15:55 +03:00
Nadav Har'El
27ad772a66 test/cqlpy: fix "run --release 2025.1"
This patch fixes "test/cqlpy/run --release 2025.1" which fails as
follows on all tests with indexes or views:

        Secondary indexes are not supported on base tables with tablets

test/cqlpy/run can run cqlpy (and alternator) tests on various official
releases of Scylla which it knows how to download. When running old
versions of Scylla, we need to change the configuration options to those
that were needed on specific versions.

On new versions of Scylla we need to pass
        --experimental-features=views-with-tablets
to be able to test materialized views, but in older versions we need to
remove that parameter because it didn't exist. We incorrectly removed it
for any versions 2025.1 or earlier, but that's incorrect - it just needs
to be removed for versions strictly earlier than 2025.1 - it is needed
for 2025.1 (I tested it is indeed needed even in the earliers RCs).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24144
2025-05-15 16:13:01 +03:00
Pavel Emelyanov
2f5b452c7c api: Drop class req_params
It's not unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:08:52 +03:00
Pavel Emelyanov
9628c3a4a5 api: Stop using req_params in parse_scrub_options
The "keyspace" and "cf" pair of options are now parsed similarly to how
recently changed ss::force_keyspace_compaction handler does.

The "scrub_mode" query param is saved directly into sstring variable and
its presense is checked by .empty() call. If the parameter is missing,
the request::get_query_param() would return empty string, so the change
is correct.

The "skip_corrupted" is boolean option, other options are already parsed
by hand, without the help of req_params facilities.

There's a test that validates the work of req_params::process() of scrub
endpoint -- it passes "invalid" options. This test is temporarily
removed according to the PR description.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:57 +03:00
Pavel Emelyanov
fd0128849e api: Stop using req_params in tasks::force_keyspace_compaction_async
This handler is in fact duplicates the cf::force_major_compaction in how
it parses its options, so the change is the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:53 +03:00
Pavel Emelyanov
09c9a5baa7 api: Stop using req_params in ss::force_keyspace_compaction
The "keyspace" mandatory param and "cf" query one are used,
respectively, to get and validate keyspace and to parse table infos.
Both actions can be used with the corresponding parse_table_infos()
overload.

Other parameters are boolean query ones and can be parsed directly.
By and large this change repeats the change in
cf::force_major_compaction done previously.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:52 +03:00
Pavel Emelyanov
f7e8d6ba09 api: Stop using req_params in ss::force_compaction
This handler only has two query parameters that can be parsed using the
validate_bool_x helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:52 +03:00
Pavel Emelyanov
a320550bd1 api: Stop using req_params in cf::force_major_compaction
The mandatory "name" parameter can be picked directly from request path
params, as described in the PR description.

The "split_output" is placeholder and is just checked for being there at
all, without any parsing.

Other parameters are query ones too, and are parsed with the help of
recently introduced validate_bool_x helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:52 +03:00
Pavel Emelyanov
253c82f03a api: Add validate_bool_x() helper
There's validate_bool() one that converts "true" to true and "false" to
false. This helper mimics the req_params' parser of bool and renders
true from "true", "yes" or "1" and false from "false", "no" or "0" (all
case insensitively). Unlike its prototype, which renders disengaged
optional bool in case the parameter is empty, this helper returns the
passed default value.

Will replace the req_params eventually.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-15 11:07:52 +03:00
Botond Dénes
697945820b Merge 'utils: chunked_vector: add some modifiers' from Avi Kivity
chunked_vector is a replacement for std::vector that avoids large contiguous
allocations.

In this series, we add some missing modifiers and improve quality-of-life for
chunked_vector users (the static_assert patch).

Those modifiers were generally unused since they have O(n) complexity
and therefore not useful for hot paths, but they are used in some
control plane code on vectors which we'd like to replace with chunked_vectors.

A candidate for such a replacement is token_range_vector (see #3335).

This is a prerequisite for fixing some minor stalls; I don't expect we'll backport
fixes to those stalls.

Closes scylladb/scylladb#24162

* github.com:scylladb/scylladb:
  utils: chunked_vector: add swap() method
  utils: chunked_vector: add range insert() overloads
  utils: chunked_vector: relax static_assert
  utils: chunked_vector: implement erase() for single elements and ranges
  utils: chunked_vector: implement insert() for single-element inserts
2025-05-15 09:42:14 +03:00
Yaron Kaikov
f124b073b1 toolchain: set scylla-driver release based on tools/cqlsh
In `install-dependencies.sh` we use hardcoded `scylla-driver` release.
this version should be identical to `tools/cqlsh/requirements.txt`
value.
It's better to have once source for `scylla-driver` version. upading
`install-dependancies.sh` to use the release from `tools/cqlsh` directly

Removing `geomet` hardcoded version

Also removing the support for `s390x` arch as we never use it

Frozen toolchain regenerated.

Optimized clang from
* https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
* https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#23841
2025-05-15 06:08:14 +03:00
Pavel Emelyanov
2e83b0367f api: Use structured bindings in get_built_indexes() code
Shorter this way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24155
2025-05-14 19:03:13 +03:00
Wojciech Mitros
5920647617 mv: remove queue length limit from the view update read concurrency semaphore
Each view update is correlated to a write that generates it (aside from view
building which is throttled separately). These writes are limited by a throttling
mechanism, which effectively works by performing the writes with CL=ALL if
ongoing writes exceed some memory usage limit

When writes generate view updates, they usually also need to perform a read. This read
goes through a read concurrency semaphore where it can get delayed or killed. The
semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue.
If the number of queued reads exceeds a specific limit, the view update will fail on
the replica, causing inconsistencies.

This limit is not necessary. When a read gets queued on the semaphore, the write that's
causing the view update is paused, so the write takes part in the regular write throttling.
If too many writes get stuck on view update reads, they will get throttled, so their
number is limited and the number of queued reads is also limited to the same amount.

In this patch we remove the specified queue length limit for the view update read concurrency
semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write
throttling mechanism. This may allow the queue grow longer than with the previous limit, but
it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the
remaining ones that get queued use a tiny amount of memory, less than the writes that generated
them and which are getting limited directly.

Fixes https://github.com/scylladb/scylladb/issues/23319

Closes scylladb/scylladb#24112
2025-05-14 18:29:30 +03:00
Botond Dénes
700a5f86ed tools/scylla-nodetool: status: handle negative load sizes
Negative load sizes don't make sense, but we've seen a case in
production, where a negative number was returned by ScyllaDB REST API,
so be prepared to handle these too.

Fixes: scylladb/scylladb#24134

Closes scylladb/scylladb#24135
2025-05-14 18:28:29 +03:00
Avi Kivity
70be73d036 Merge 'Refactor out code from test_restore_with_streaming_scopes' from Robert Bindar
Lots of code from this test can be reused in PR #23861. I'm splitting it now in this change so we can merge it cleanly as a separate patch.

Refs #23564

Closes scylladb/scylladb#24105

* github.com:scylladb/scylladb:
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
2025-05-14 18:10:53 +03:00
Botond Dénes
9f8de9adc8 Merge 'Add ability to skip SSTables cleanup when loading them' from Pavel Emelyanov
The non-streaming loading of sstables performs cleanup since recently [1]. For vnodes, unfortunately, cleanup is almost unavoidable, because of the nature of vnodes sharding, even if sstable is already clean. This leads to waste of IO and CPU for nothing. Skipping the cleanup in a smart way is possible, but requires too many changes in the code and in the on-disk data. However, the effort will not help existing SSTables and it's going to be obsoleted by tablets some time soon.

Said that, the easiest way to skip cleanup is the explicit --skip-cleanup option for nodetool and respective skip_cleanup parameter for API handler.

New feature, no backport

fixes #24136
refs #12422 [1]

Closes scylladb/scylladb#24139

* github.com:scylladb/scylladb:
  nodetool: Add refresh --skip-cleanup option
  api: Introduce skip_cleanup query parameter
  distributed_loader: Don't create owned ranges if skip-cleanup is true
  code: Push bool skip_cleanup flag around
2025-05-14 16:47:34 +03:00
Avi Kivity
13a75ff835 utils: chunked_vector: add swap() method
Following std::vector(), we implement swap(). It's a simple matter
of swapping all the contents.

A unit test is added.
2025-05-14 16:19:40 +03:00
Avi Kivity
24e0d17def utils: chunked_vector: add range insert() overloads
Inserts an iterator range at some position.

Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.

Unit tests are added.
2025-05-14 16:19:40 +03:00
Avi Kivity
9425a3c242 utils: chunked_vector: relax static_assert
chunked_vector is only implemented for types with a
non-throwing move constructor; this greatly simplifies
the implementation.

We have a static_assert to enforce it (should really
be a constraint, but chunked_vector predates C++ concepts).

This static_assert prevents forward declarations from compiling:

    class forward_declared;
    using a = utils::chunked_vector<forward_declared>;

`a` won't compile since the static_assert will be instantiated
and will fail since forward_declared is an incomplete type. Using
a constraint has the same problem.

Fix by moving the static_assert to the destructor. The destructor
won't be instantiated by the forward declaration, so it won't
trigger. It will trigger when someone destroys the vector; at this
point the types are no longer forward declared.
2025-05-14 16:19:40 +03:00
Avi Kivity
d6eefce145 utils: chunked_vector: implement erase() for single elements and ranges
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.

Again we defer optimization for trivially copyable types.

Unit tests are added.

Needed for range_streamer with token_ranges using chunked_vector.
2025-05-14 16:19:37 +03:00
Botond Dénes
b491ae1039 Merge 'raft_sys_table_storage: avoid temp buffer when deserializing log_entry' from Petr Gusev
The get_blob method linearizes data by copying it into a single buffer, which can cause 'oversized allocation' warnings.

In this commit we avoid copying by creating input stream on top of the original fragmened managed bytes, returned by untyped_result_set_row::get_view.

fixes scylladb/scylladb#23903

backport: no need, not a critical issue.

Closes scylladb/scylladb#24123

* github.com:scylladb/scylladb:
  raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
  serializer_impl.hh:  add as_input_stream(managed_bytes_view) overload
2025-05-14 15:10:47 +03:00
Avi Kivity
5301f3d0b5 utils: chunked_vector: implement insert() for single-element inserts
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).

Implement using push_back() and std::rotate().

emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).

The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.

Unit tests are added.
2025-05-14 14:54:59 +03:00
Robert Bindar
548a1ec20a Refactor out code from test_restore_with_streaming_scopes
part 5: check_data_is_back

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-05-14 11:39:01 +03:00
Robert Bindar
29309ae533 Refactor out code from test_restore_with_streaming_scopes
part 4: compute_scope

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-05-14 11:39:01 +03:00
Robert Bindar
a0f0580a9c Refactor out code from test_restore_with_streaming_scopes
part 3: create_dataset

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-05-14 11:38:59 +03:00
Robert Bindar
5171ca385a Refactor out code from test_restore_with_streaming_scopes
part 2: take_snapshot

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-05-14 11:31:19 +03:00
Robert Bindar
f09bb20ac4 Refactor out code from test_restore_with_streaming_scopes
part 1: create_cluster

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-05-14 11:30:40 +03:00
Andrzej Jackowski
8b660f0af7 test: add tests for prepared statement metadata consistency corner cases
Implement corner-cases of prepared statement metadata, as described in
scylladb#20860.

Although the purpose of the test was to verify the newly implemented
SCYLLA_USE_METADATA_ID protocol extension, the test also passes with
scylla-driver 3.29.3 that doesn't implement the support for this
extension. That is because the driver doesn't implement support for
skip_metadata flag, so fresh metadata are included in every prepared
statement response, regardless of the metadata_id.

This change:
 - Add test_changed_prepared_statement_metadata_columns to verify
   a scenario when a number of columns changes in a table used by a
   prepared statement
 - Add test_changed_prepared_statement_metadata_types to verify
   a scenario when a type of a column changes in a table used by a
   prepared statement
 - Add test_changed_prepared_statement_metadata_udt to veriy
   a scenario when a UDT changes in a table used by a prepared statement

I tested the code with a modified Python driver
(ref. scylladb/python-driver#457):
 - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1)
   but not other changes are introduced, all three test cases fail.
 - If SKIP_METADATA is disabled (no scylladb/python-driver@c1809c1) all
   test cases pass because fresh metadata are included in each reply.
 - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1)
   and SCYLLA_USE_METADATA_ID extension is included
   (scylladb/python-driver@8aba164) all test cases pass and verifies
   the correctness the implementation.
2025-05-14 09:59:19 +02:00
Andrzej Jackowski
086df24555 transport: implement SCYLLA_USE_METADATA_ID support
Metadata id was introduced in CQLv5 to make metadata of prepared
statement consistent between driver and database. This commit introduces
a protocol extension that allows to use the same mechanism in CQLv4.

This change:
 - Introduce SCYLLA_USE_METADATA_ID protocol extension for CQLv4
 - Introduce METADATA_CHANGED flag in RESULT. The flag cames directly
   from CQLv5 binary protocol. In CQLv4, the bit was never used, so we
   assume it is safe to reuse it.
 - Implement handling of metadata_id and METADATA_CHANGED in RESULT rows
 - Implement returning metadata_id in RESULT prepared
 - Implement reading metadata_id from EXECUTE
 - Added description of SCYLLA_USE_METADATA_ID in documentation

Metadata_id is wrapped in cql_metadata_id_wrapper because we need to
distinguish the following situations:
 - Metadata_id is not supported by the protocol (e.g. CQLv4 without the
   extension is used)
 - Metadata_id is supported by the protocol but not set - e.g. PREPARE
   query is being handled: it doesn't contain metadata_id in the
   request but the reply (RESULT prepared) must contain metadata_id
 - Metadata_id is supported by the protocol and set, any number of
   bytes >= 0 is allowed, according to the CQLv5 protocol specification

Fixes scylladb/scylladb#20860
2025-05-14 09:59:16 +02:00
Andrzej Jackowski
c32aba93b4 cql3: implement metadata::calculate_metadata_id()
CQLv5 introduced metadata_id, which is a checksum computed from column
names and types, to track schema changes in prepared statements. This
commit introduces calculate_metadata_id to compute such id for given
metadata.

Please note that calculate_metadata_id() produces different hashes
than Cassandra's computeResultMetadataId(). We use SHA256 truncated to
128 bits instead of MD5. There are also two smaller technical
differences: calculate_metadata_id() doesn't add unneeded zeros and it
adds a length of a string when an sstring is being fed to the hasher.
The difference is intentional because MD5 has known vulnerabilities,
moreover we don't want to introduce any dependency between our
metadata_id and Cassandra's.

This change:
 - Add cql_metadata_id_type
 - Implement metadata::calculate_metadata_id()
 - Add boost tests to confirm correctness of the function
2025-05-14 09:33:16 +02:00
Michał Hudobski
8ea862f1e8 test/cqlpy: add custom index tests
Unit tests checking the behavior
of the added support for create
custom index statement
2025-05-14 09:32:01 +02:00
Michał Hudobski
05daa8dded index: support storing metadata for custom indices
Added function returning custom index class name.
Added printing custom index class name when using DESCRIBE.
Changed validation to reflect current support of indices.
2025-05-14 09:32:00 +02:00
Łukasz Paszkowski
0327964d57 compaction: Extend compaction_result to collect more information
The compaction_result struct has been extended with the following
properties:
+ id of the shard the compaction took place on
+ type of the compaction
+ time when the compaction started
+ list of sstable files to be compacted
+ list of sstable files generated by compaction
2025-05-14 08:32:07 +02:00
Łukasz Paszkowski
0490068982 system_keyspace: Upgrade compaction_history table
Currently, the system.compaction_history table miss precious
information like the type of compaction (cleanup, major, resharding,
etc) or the sstable generations involved (in and out) used countless
times to diagnose issues.

Thus, the commit extend the current definition of the table by adding
the following columns:
+ "compaction_type" (text)
+ "started_at" (int)
+ "shard_id" (int)
+ "sstables_in" (list<sstableinfo_type>)
+ "sstables_out" (list<sstableinfo_type>)
+ "total_tombstone_purge_attempt" (long)
+ "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
+ "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)

Furthermore, the commit introduces a new feature flag in order to
prevent nodes from writing data to new columns when a cluster is
not fully upgraded.
2025-05-14 08:32:05 +02:00
Łukasz Paszkowski
28d0c98dab system_keyspace: Create UDT: sstableinfo_type
The new user defined type holds the following information on sstable:
+ generation uuid;
+ origin text;
+ size long;

and will be used by the system.compaction_history table to keep
track of compacted files and the files being the result of this
compaction.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
dc6f8881b8 system_keyspace: Extract compaction_history struct
Move the compaction_history_entry struct to a seperate file. The intent
of this change is to later re-use it in scylla-nodetool as it currently
defines its own structure that is very similar.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
4c93b5292d system_keyspace: Squeeze update_compaction_history parameters
Since the number of statistics inserted into compaction_history
table grows in time, the number of parameters in the method
update_compaction_history grows as well.

So instead, let's re-use the already existing compaction_history_entry
structure to populate data from the compaction_manager to the
system table.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
342e9a3f5c compaction/compaction_manager: update_history accepts compaction_result as rvalue
The compaction_result struct holding compaction's results and statistics
is obtained immediatelly before the update_history is called. Move
it instead of passing a cont reference.
2025-05-14 08:31:40 +02:00
Andrzej Jackowski
f8f710c95e test: simplify pytest params in test_long_query_timeout_erm
One of pytest parameters in test_long_query_timeout_erm.py was
a CQL query containing spaces and special chars such as '*', '(', ')',
'{', '}'. After upgrading to Fedora 42, the test started to
fail with the error "test.pylib.rest_client.HTTPError: HTTP error 404"
with uri=`http://...[SELECT * FROM {}-True-False].dev.1`.

To prevent from such errors, this commit changes the parameter to
a string without spaces and such special characters.

Fixes: scylladb/scylladb#24124

Closes scylladb/scylladb#24130
2025-05-13 21:44:15 +03:00
Benny Halevy
2ceecc9d2a generic_server: server: do_accepts: prevent gate_closed_exception
do_accepts might be called after `_gate` was closed.
In this case it should just return early rather
than throw gate_closed_exception, similar to the it breaks
from the infinite for loop when the _gate is closed.

With this change, do_accepts (and consequently, _listeners_stopped),
should never fail as it catches and ignores all exceptions
in the loop.

Fixes #23775

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23818
2025-05-13 20:00:04 +03:00
Pavel Emelyanov
c0796244bb nodetool: Add refresh --skip-cleanup option
The option "conflicts" with load-and-stream. Tests and doc included.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-13 19:07:38 +03:00
Pavel Emelyanov
1b1f653699 api: Introduce skip_cleanup query parameter
Just copy the load_and_stream and primary_replica_only logic, this new
option is the same in this sense.

Throw if it's specified with the load_and_stream one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-13 17:06:28 +03:00
Pavel Emelyanov
ed3ce0f6af distributed_loader: Don't create owned ranges if skip-cleanup is true
In order to make reshard compaction task run cleanup, the owner-ranges
pointer is passed to it. If it's nullptr, the cleanup is not performed.
So to do the skip-cleanup, the easiest (but not the most apparent) way
is not to initialize the pointer and keep it nullptr.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-13 16:52:15 +03:00
Pavel Emelyanov
4ab049ac8d code: Push bool skip_cleanup flag around
Just put the boolean into the callstack between API and distributed
loader to reduce the churn in the next patches. No functional changes,
flag is false and unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-13 16:51:21 +03:00
Dawid Mędrek
9ebd6df43a locator/production_snitch_base: Reduce log level when property file incomplete
We're reducing the log level in case the provided property file is incomplete.
The rationale behind this change is related to how CCM interacts with Scylla:

* The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties`
  configuration every 60 seconds.
* When a new node is added to the cluster, CCM recreates the
  `cassandra-rackdc.properties` file for EVERY node.

If those two processes start happening at about the same time, it may lead
to Scylla trying to read a not-completely-recreated file, and an error will
be produced.

Although we would normally fix this issue and try to avoid the race, that
behavior will be no longer relevant as we're making the rack and DC values
immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem
in the older versions of Scylla could bring a more serious regression. Having
that in mind, this commit is a compromise between making CI less flaky and
having minimal impact when backported.

We do the same for when the format of the file is invalid: the rationale
is the same.

We also do that for when there is a double declaration. Although it seems
impossible that this can stem from the same scenario the other two errors
can (since if the format of the file is valid, the error is justified;
if the format is invalid, it should be detected sooner than a doubled
declaration), let's stay consistent with the logging level.

Fixes scylladb/scylladb#20092

Closes scylladb/scylladb#23956
2025-05-13 13:59:39 +03:00
Andrei Chekun
c33c0d62e1 test.py: change pattern for cleaning .log files in testlog directory
Currently, test.py will delete recursively all .log files under the
testlog directory instead of cleaning only on testlog directory. With
this change it will not go deeper to delete log files. We still have a
method for cleaning the log files in modes directories.
The downside of this solution, that we will need to explicitly tell all
directories that we want to clean.

Fixes: https://github.com/scylladb/scylladb/issues/24001

Closes scylladb/scylladb#24004
2025-05-13 13:58:36 +03:00
Anna Stuchlik
eed8373b77 doc: remove the redundant pages
This commit removes two redundant pages and adds the related redirections.

- The Tutorials page is a duplicate and is not maintained anymore.
  Having it in the docs hurts the SEO of the up-to-date Tutorias page.
- The Contributing page is not helpful. Contributions-related information
  should be maintained in the project README file.

Fixes https://github.com/scylladb/scylladb/issues/17279
Fixes https://github.com/scylladb/scylladb/issues/24060

Closes scylladb/scylladb#24090
2025-05-13 13:29:04 +03:00
Andrei Chekun
747f2b1301 docs: add more steps in installation of test.py
Documentation for --gather-metric parameter was missing. This functionality can
break regular flow of using test.py, because of possible misconfiguration of
the cgroup on the local machine. Added explanation how to deal with potential
issue of gathering metrics functionality and how to switch it off.

Fixes: https://github.com/scylladb/scylladb/issues/20763

Closes scylladb/scylladb#24095
2025-05-13 13:08:18 +03:00
Ernest Zaslavsky
2d5c0f0cfd encryption_test: Catch exact exception
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.

Start catching the exact exception that we expect to be thrown.

Closes scylladb/scylladb#24065
2025-05-13 12:55:19 +03:00
Ernest Zaslavsky
4a7c847cba database_test: Wait for the index to be created
Just call `wait_until_built` for the index in question

fix: https://github.com/scylladb/scylladb/issues/24059

Closes scylladb/scylladb#24117
2025-05-13 11:40:55 +03:00
Petr Gusev
f245b05022 raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
The get_blob() method linearizes data by copying it into a
single buffer, which can trigger "oversized allocation" warnings.
This commit avoids that extra copy by creating an input stream
directly over the original fragmented managed bytes returned by
untyped_result_set_row::get_view().

Fixes scylladb/scylladb#23903
2025-05-13 10:33:57 +02:00
Petr Gusev
6496ae6573 serializer_impl.hh: add as_input_stream(managed_bytes_view) overload
It's useful to have it here so that people can find it easily.
2025-05-13 10:32:32 +02:00
Wojciech Mitros
bceb64fb5a test_mv_tablets_replace: wait for tablet replicas to balance before working on them
In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2.
Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially,
the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that
hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the
base or the view table because the node with the base table replica may also be a view replica.

After some time, the tablets should be distributed across all nodes. When that happens, there will be
no common nodes with a base and view replica, so the test scenario will continue as planned.

In this patch, we add this waiting period after creating the base and view, and continue the test only
when all 4 tablets are on distinct nodes.

Fixes https://github.com/scylladb/scylladb/issues/23982
Fixes https://github.com/scylladb/scylladb/issues/23997

Closes scylladb/scylladb#24111
2025-05-12 16:17:48 +02:00
Nadav Har'El
248688473d build: when compiling without -g, don't leave debugging information
If Scylla is compiled without "-g" (this is, for example, the default
in dev build mode), any static library that we link with it and  contains
any debugging information will cause the resulting executable to
incorrectly look (e.g., to file(1) or to gdb) like it has debugging
information.

For more than three years now (see #10863 for historical context),
the wasmtime.a library, which has debugging symbols, has caused this
to happen.

In this patch, if a certain build is compiled WITHOUT "-g", we add the
"--strip-debug" option to the linker to remove the partial debugging
information from the executable. Note that --strip-debug is not added
in build modes which do use "-g", or if the user explicitly asked to
add -g (e.g., "configure.py --cflags=-g").

Before this patch:
$ file build/dev/scylla
build/dev/scylla: ELF 64-bit LSB executable ... , with debug_info, not stripped

Ater this patch:
$ file build/dev/scylla
build/dev/scylla: ELF 64-bit LSB executable ... , not stripped

Fixes #23832.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23840
2025-05-12 15:42:17 +03:00
Ujjawal Kumar
35cd200789 ent/encryption/kms_host.cc: Change regex pattern to include hyphens in AWS profile names.
Fixes #22430

Closes scylladb/scylladb#23805
2025-05-12 15:41:00 +03:00
Botond Dénes
746382257c Merge 'compress: fix an internal error when a specific debug log is enabled' from Michał Chojnowski
compress: fix an internal error when a specific debug log is enabled
While iterating over the recent 69684e16d8,
series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)`
to be an internal error, and later calling that anyway in a debug log.

(Tests didn't catch it because there's no test which simultaneously
enables the debug log and configures some table to have no compression).

This proves that `algorithm_to_name` is too much of a footgun.
Fix it so that calling `algorithm_to_name(algorithm::none)` is legal.
In hindsight, I should have done that immediately.

Fixes #23624

Fix for recently-added code, no backporting needed.

Closes scylladb/scylladb#23625

* github.com:scylladb/scylladb:
  test_sstable_compression_dictionaries: reproduce an internal error in debug logging
  compress: fix an internal error when a specific debug log is enabled
2025-05-12 15:40:12 +03:00
Calle Wilund
b28413890b encryption_at_rest_test: Add test cases for bad KMIP config on reboot
Refs scylladb/scylla-enterprise#5321

Adds two small test cases, for slight variations on KMIP host config
being missing when rebooting a node, and table/sstable resolution
failing due to this.
Mainly to verify that we fail as expected, without crashing.

Closes scylladb/scylladb#23544
2025-05-12 15:39:05 +03:00
Nadav Har'El
7c24e09b0d test/alternator: add some Alternator-over-HTTPS tests
This patch adds a few tests for Alternator over HTTPS (encrypted HTTP,
a.k.a. TLS or SSL). The tests are skipped unless run with "--https", so
they will not be run in CI. Nevertheless, they are useful to improve
our understanding on how DynamoDB works over HTTPS and can be a basis
for adding more tests for HTTPS support. The included tests pass on both
Alternator and AWS DynamoDB.

One test checks that both TLS 1.2 and TLS 1.3 are properly supported,
and if chosen by the client, are actually honored. The same test also
checks that TLS 1.1 is not supported, and results with a proper error
if attempted. Both AWS DynamoDB and Alterator support the same protocols.

Another test verifies that HTTP (unencrypted) requests cannot be sent
over an HTTPS port. This is important for security - an installation
that chooses to allow only HTTPS wants users to only use encrypted
connections, and would not want users to continue sending unencrypted
requests to the HTTPS port.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23493
2025-05-12 15:38:33 +03:00
Kefu Chai
8320d703cd scripts/open-coredump.sh: Add substitute-path hint in prompt message
Add a substitute-path rule hint in the greeting message displayed before
launching dbuild. This helps developers debug coredumps by correctly mapping
source files.

Background:
- Scylla's Jenkins builds typically occur in /jenkins/workspace/scylla-${branch}/next
- When debugging locally, source paths need remapping to match the build environment
- The substitute-path rule allows GDB to locate source files correctly

This change improves developer experience by providing the appropriate path
substitution command directly in the prompt.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23038
2025-05-12 15:37:59 +03:00
Kefu Chai
46f7ff6cfc docs: nodetool: reference "nodetool task" page
* Rewrite the documentation for the "nodetool restore" command.
* Clarify the relationship between the `--nowait` flag and asynchronous operation.
* Reference the "nodetool task" page for managing background tasks.

Fixes scylladb#21888

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22023
2025-05-12 15:37:22 +03:00
Botond Dénes
dff7e2fc2f Merge 'gossiper: failure_detector_loop_for_node: abort send_gossip_echo using abort_source' from Benny Halevy
Currently send_gossip_echo has a 22 seconds timeout
during which _abort_source is ignored.

Use a function-local abort_source to abort
send_gossip_echo either on timeout or if
_abort_source requested abort, and co_return in
the latter case.

Closes scylladb/scylladb#12296

* github.com:scylladb/scylladb:
  gossiper: make send_gossip_echo cancellable
  gossiper: add send_echo helper
  idl, message: make with_timeout and cancellable verb attributes composable
  gossiper: failure_detector_loop_for_node: ignore abort_requested_exception
  gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition
2025-05-12 15:35:30 +03:00
Pavel Emelyanov
5bd3df507e sstables: Lazily access statistics for trace-level logging
There's a message in sstable::get_gc_before_for_fully_expire() method
that is trace-level and one of its argument finds a value in sstable
statisitics. Finding the value is not quite cheap (makes a lookup in
std::unordered_map) and for mostly-off trace messages is just a waste of
cycles.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23910
2025-05-12 11:22:31 +03:00
Patryk Jędrzejczak
4d0538eecb Merge 'test/cluster: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.

Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.

I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.

Fixes scylladb/scylladb#23959

Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.

Closes scylladb/scylladb#23661

* https://github.com/scylladb/scylladb:
  test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite
  test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
  test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
  test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
  test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
  test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
  test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
  test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
  test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
  test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity
  test/cluster/test_multidc.py: Adjust to RF-rack-validity
  test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity
  test/cluster: Adjust simple tests to RF-rack-validity
2025-05-12 09:41:07 +02:00
Aleksandra Martyniuk
2dcea5a27d streaming: use host_id in file streaming
Use host ids instead of ips in file-streaming.

Fixes: #22421.

Closes scylladb/scylladb#24055
2025-05-12 09:36:48 +03:00
Łukasz Paszkowski
113647550f tools/scylla-nodetool: fix crash when rows_merged cells contain null
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.

Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.

In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`

The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.

Fixes https://github.com/scylladb/scylladb/issues/23540

Closes scylladb/scylladb#23514
2025-05-12 09:00:48 +03:00
Avi Kivity
5e764d1de2 Merge 'Drop v2 and flat from reader and related names' from Botond Dénes
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.

Code cleanup, no backport required.

Closes scylladb/scylladb#24087

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_another_test: drop v2 from reader and related names
  test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
  test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
  test/boost/mutation_test: s/consumer_v2/consumer/
  test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
  readers/mutation_readers: s/generating_reader_v2/generating_reader/
  readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
  readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
  readers/mutation_source: s/make_reader_v2/make_mutation_reader/
  readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
  readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
  mutation/mutation_compactor: drop v2 from compactor and related names
  replica/table: s/make_reader_v2/make_mutation_reader/
  mutation_writer: s/bucket_writer_v2/bucket_writer/
  readers/queue: drop v2 from reader and related names
  readers/multishard: drop v2 from reader and related names
  readers/evictable: drop v2 from reader and related names
  readers/multi_range: remove flat from name
2025-05-11 22:22:35 +03:00
Botond Dénes
3ba5dd79e6 tools/scylla-nodetool: document exit codes in --help
Closes scylladb/scylladb#24054
2025-05-11 22:18:29 +03:00
Dawid Mędrek
ee96f8dcfc test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.
2025-05-10 16:30:51 +02:00
Dawid Mędrek
c4b32c38a3 test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
2025-05-10 16:30:49 +02:00
Dawid Mędrek
c8c28dae92 test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
2025-05-10 16:30:46 +02:00
Dawid Mędrek
04567c28a3 test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
2025-05-10 16:30:43 +02:00
Dawid Mędrek
d3c0cd6d9d test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.

The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.
2025-05-10 16:30:40 +02:00
Dawid Mędrek
5d1bb8ebc5 test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of  RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.
2025-05-10 16:30:37 +02:00
Dawid Mędrek
92f7d5bf10 test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.
2025-05-10 16:30:34 +02:00
Dawid Mędrek
4c46551c6b test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.
2025-05-10 16:30:31 +02:00
Dawid Mędrek
2882b7e48a test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.

Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.
2025-05-10 16:30:29 +02:00
Dawid Mędrek
73b22d4f6b test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.
2025-05-10 16:30:26 +02:00
Dawid Mędrek
5b83304b38 test/cluster/test_multidc.py: Adjust to RF-rack-validity
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.
2025-05-10 16:30:23 +02:00
Dawid Mędrek
9281bff0e3 test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.

Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.
2025-05-10 16:30:20 +02:00
Dawid Mędrek
dbb8835fdf test/cluster: Adjust simple tests to RF-rack-validity
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:

* Using `pytest.mark.prepare_3_racks_cluster` instead of
  `pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
  `ManagerClient::servers_add()`.

In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
2025-05-10 16:30:18 +02:00
Botond Dénes
911aa64043 test/boost/mutation_reader_another_test: drop v2 from reader and related names
For the test case
test_mutation_reader_from_mutations_as_mutation_source, the v1/v2
distinction was hiding two identical test cases. One was removed.
2025-05-09 07:53:30 -04:00
Botond Dénes
466a8a2b64 test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ 2025-05-09 07:53:30 -04:00
Botond Dénes
30625a6ef7 test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ 2025-05-09 07:53:30 -04:00
Botond Dénes
1169ac6ac8 test/boost/mutation_test: s/consumer_v2/consumer/ 2025-05-09 07:53:30 -04:00
Botond Dénes
17b667b116 test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ 2025-05-09 07:53:30 -04:00
Botond Dénes
5dd546ea2b readers/mutation_readers: s/generating_reader_v2/generating_reader/ 2025-05-09 07:53:30 -04:00
Botond Dénes
75fddbc078 readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ 2025-05-09 07:53:30 -04:00
Botond Dénes
2fc3e52b2b readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
674d41e3e6 readers/mutation_source: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
327867aa8a readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ 2025-05-09 07:53:29 -04:00
Botond Dénes
efc48caea5 readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ 2025-05-09 07:53:29 -04:00
Botond Dénes
7af0690762 mutation/mutation_compactor: drop v2 from compactor and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
b5170e27d0 replica/table: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
cc95dc8756 mutation_writer: s/bucket_writer_v2/bucket_writer/ 2025-05-09 07:53:29 -04:00
Botond Dénes
3d2651e07c readers/queue: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
ca7f557e86 readers/multishard: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
4d92bc8b2f readers/evictable: drop v2 from reader and related names 2025-05-09 07:53:28 -04:00
Botond Dénes
7ba3c3fec3 readers/multi_range: remove flat from name 2025-05-09 07:53:25 -04:00
Avi Kivity
092a88c9b9 dist: drop the scylla-env package
scylla-env was used to glue together support for older
distributions. It hasn't been used for many years. Remove
it.

Closes scylladb/scylladb#23985
2025-05-09 14:10:00 +03:00
Raphael S. Carvalho
28056344ba replica: Fix take_storage_snapshot() running concurrently to merge completion
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.

Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault

In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.

Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.

Fixes #23162.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24058
2025-05-09 14:07:06 +03:00
Gleb Natapov
c6e1758457 topology coordinator: make decommissioning node non voter before completing the operation
A decommissioned node is removed from a raft config after operation is
marked as completed. This is required since otherwise the decommissioned
node will not see that decommission has completed (the status is
propagated through raft). But right after the decommission is marked as
completed a decommissioned node may terminate, so in case of a two node
cluster, the configuration change that removes it from the raft will fail,
because there will no be quorum.

The solution is to mark the decommissioning node as non voter before
reporting the operation as completed.

Fixes: #24026

Backport to 2025.2 because it fixes a potential hang. Don't backport to
branches older than 2025.2 because they don't have
8b186ab0ff, which caused this issue.

Closes scylladb/scylladb#24027
2025-05-09 12:43:31 +02:00
Tomasz Grabiec
be2c3ad6fd Merge 'logalloc_test: don't test performance in test background_reclaim' from Michał Chojnowski
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test). After the patch,
we only check that the background reclaim happens *eventually*.

Fixes https://github.com/scylladb/scylladb/issues/15677

Backporting this is optional. The test is flaky even in stable branches, but the failure is rare.

Closes scylladb/scylladb#24030

* github.com:scylladb/scylladb:
  logalloc_test: don't test performance in test `background_reclaim`
  logalloc: make background_reclaimer::free_memory_threshold publicly visible
2025-05-09 11:35:02 +02:00
Patryk Jędrzejczak
be4532bcec Merge 'Correctly skip updating node's own ip address due to oudated gossiper data ' from Gleb Natapov
Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node.

Fixes: #22777

Backport to 2025.1 since this fixes a crash. Older version do not have the code.

Closes scylladb/scylladb#24000

* https://github.com/scylladb/scylladb:
  test: add reproducer for #22777
  storage_service: Do not remove gossiper entry on address change
  storage_service: use id to check for local node
2025-05-09 11:28:21 +02:00
Andrzej Jackowski
f53d733e89 docs: lwt: add two missing spaces
Due to lack of spaces, two example queries were not displayed in the
rendered version of the document.

In result, the `SELECT * FROM movies.nowshowing;` query in the step 6.
returned 6 rows instead of expected 8 rows.
2025-05-09 08:42:15 +02:00
Piotr Smaron
f740f9f0e1 cql: fix CREATE tablets KS warning msg
Materialized Views and Secondary Indexes are yet another features that
keyspaces with tablets do not support, but these were not listed in a
warning message returned to the user on CREATE KEYSPACE statement. This
commit adds the 2 missing features.

Fixes: #24006

Closes scylladb/scylladb#23902
2025-05-08 17:18:43 +02:00
Tomasz Grabiec
fadfbe8459 Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.

This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).

This change:
- Rewrite cql_server::connection::process_request_one to use
  seastar::futurize_invoke and try_catch<> instead of utils::result_try
- Add new read_failure_exception_with_timeout and throws it in storage_proxy
- Add sleep in CQL server when the new exception is caught
- Catch local exceptions in Mapreduce Service and convert them
   to std::runtime_error.
- Add get_cql_exclusive to manager_client.py
- Add test_long_query_timeout_erm

No backport needed - minor issue fix.

Closes scylladb/scylladb#23156

* github.com:scylladb/scylladb:
  test: add test_long_query_timeout_erm
  test: add get_cql_exclusive to manager_client.py
  mapreduce: catch local read_failure_exception_with_timeout
  transport: storage_proxy: release ERM when waiting for query timeout
  transport: remove redundant references in process_request_one
  transport: fix the indentation in process_request_one
  transport: add futures in CQL server exception handling
2025-05-08 12:45:49 +02:00
Avi Kivity
2d2a2ef277 tools: toolchain: dbuild: support nested containers
Pass through the local containers directory (it cannot
be bind-mounted to /var/lib/containers since podman checks
the path hasn't changed) with overrides to the paths. This
allows containers to be created inside the dbuild container,
so we can enlist pre-packaged software (such as opensearch)
in test.py. If the container images are already downloaded
in the host, they won't be downloaded again.

It turns out that the container ecosystem doesn't support
nested network namespaces well, so we configure the outer
container to use host networking for the inner containers.
It's useful anyway.

The frozen toolchain now installs podman and buildah so
there's something to actually drive those nested containers.
We disable weak dnf dependencies to avoid installing qemu.

The frozen toolchain is regenerated with optimized clang from

  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#24020
2025-05-08 13:00:16 +03:00
Botond Dénes
4a802baccb Merge 'compress: make sstable compression dictionaries NUMA-aware ' from Michał Chojnowski
compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.

New functionality, added to a feature which isn't in any stable branch yet. No backporting.

Closes scylladb/scylladb#23590

* github.com:scylladb/scylladb:
  test: add test/boost/sstable_compressor_factory_test
  compress: add some test-only APIs
  compress: rename sstable_compressor_factory_impl to dictionary_holder
  compress: fix indentation
  compress: remove sstable_compressor_factory_impl::_owner_shard
  compress: distribute compression dictionaries over shards
  test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
  test: remove sstables::test_env::do_with()
2025-05-08 09:52:46 +03:00
Botond Dénes
e5d944f986 Merge 'replica: Fix use-after-free with concurrent schema change and sstable set update' from Raphael Raph Carvalho
When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set.

Fixes #22040.

Closes scylladb/scylladb#23680

* github.com:scylladb/scylladb:
  replica: Fix use-after-free with concurrent schema change and sstable set update
  sstables: Implement sstable_set_impl::all_sstable_runs()
2025-05-08 06:56:16 +03:00
Petr Gusev
e6c3f954f6 main: check if current process group controls stdin tty
test.py doesn't override stdin when starting Scylla, so when
tests are run from a terminal, isatty() returns true and
parsed command line output is not printed, which is inconvenient.

In this commit we add a check if the current process group
controls the stdin terminal. This serves two purposes:
* improves the "interactive mode" check from #scylladb/scylladb#18309,
as only the controlling process group can interact with the terminal.
* solves the test.py problem above, because test.py runs scylla in a new
session/process group (it calls setsid after fork), and is now
correctly not considered interactive.

Closes scylladb/scylladb#24047
2025-05-08 06:52:48 +03:00
Michał Chojnowski
746ec1d4e4 test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging
The test checks that merging the partition versions on-the-fly using the
cursor gives the same results as merging them destructively with apply_monotonically.

In particular, it tests that the continuity of both results is equal.
However, there's a subtlety which makes this not true.
The cursor puts empty dummy rows (i.e. dummies shadowed by the partition
tombstone) in the output.
But the destructive merge is allowed (as an expection to the general
rule, for optimization reasons), to remove those dummies and thus reduce
the continuity.

So after this patch we instead check that the output of the cursor
has continuity equal to the merged continuities of version.
(Rather than to the continuity of merged versions, which can be
smaller as described above).

Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did
the same in a different test.
Fixes https://github.com/scylladb/scylladb/issues/13642

Closes scylladb/scylladb#24044
2025-05-08 00:41:01 +02:00
Pavel Emelyanov
0a9675de01 sstable: Use fmt::to_string(sstable::filename()) to get component file path
The stream sink abort() method wants to remove component file by its
path. For that the path is calculated from storage prefix and component
basename, but there's a filename() method for it already.

SStable filenames shouldn't be considered as on-disk paths (see #23194),
but places that want it should be explicit and format the filename to
string by hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24039
2025-05-07 22:25:58 +03:00
Pavel Emelyanov
36baeaeb57 sstable: Move update_info_for_opened_data() method to private: block
The method is internally called by ssatble itself to refresh its state
after opening or assigning (from foreign info) data and index files.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24041
2025-05-07 20:58:34 +03:00
Pavel Emelyanov
c2ecc45db8 sstable: Remove validate argument from sstable::load_metadata()
There are only two callers of the method and the one that wants
validation (the sstable::load()) can do it on its own. This helps the
other caller (schema loader) being simpler and shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24038
2025-05-07 20:57:37 +03:00
Michał Chojnowski
f075674ebe test: add test/boost/sstable_compressor_factory_test
Add a basic test for NUMA awareness of `default_sstable_compressor_factory`.
2025-05-07 14:43:20 +02:00
Michał Chojnowski
518f04f1c4 compress: add some test-only APIs
Will be needed by the test added in the next patch.
2025-05-07 14:43:20 +02:00
Michał Chojnowski
66a454f61d compress: rename sstable_compressor_factory_impl to dictionary_holder
Since sstable_compressor_factory_impl no longer
implements sstable_compressor_factory, the name can be
misleading. Rename it to something closer to its new role.
2025-05-07 14:43:20 +02:00
Michał Chojnowski
e952992560 compress: fix indentation
Purely cosmetic.
2025-05-07 14:43:20 +02:00
Michał Chojnowski
6b831aaf1b compress: remove sstable_compressor_factory_impl::_owner_shard
Before the series, sstable_compressor_factory_impl was directly
accessed by multiple shards. Now, it's a part of a `sharded`
data structure and is never directly from other shards,
so there's no need to check for that. Remove the leftover logic.
2025-05-07 14:43:20 +02:00
Michał Chojnowski
1bcf77951c compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
2025-05-07 14:43:18 +02:00
Michał Chojnowski
8649adafa8 test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.
2025-05-07 14:43:04 +02:00
Aleksandra Martyniuk
2549f5e16b test_tablet_repair_hosts_filter: change injected error
test_tablet_repair_hosts_filter checks whether the host filter
specfied for tablet repair is correctly persisted. To check this,
we need to ensure that the repair is still ongoing and its data
is kept. The test achieves that by failing the repair on replica
side - as the failed repair is going to be retried.

However, if the filter does not contain any host (included_host_count = 0),
the repair is started on no replica, so the request succeeds
and its data is deleted. The test fails if it checks the filter
after repair request data is removed.

Fail repair on topology coordinator side, so the request is ongoing
regardless of the specified hosts.

Fixes: #23986.

Closes scylladb/scylladb#24003
2025-05-07 15:30:05 +03:00
Michał Chojnowski
0e4d0ded8d test: remove sstables::test_env::do_with()
`sstable_manager` depends on `sstable_compressor_factory&`.
Currently, `test_env` obtains an implementation of this
interface with the synchronous `make_sstable_compressor_factory()`.

But after this patch, the only implementation of that interface
`sstable_compressor_factory&` will use `sharded<...>`,
so its construction will become asynchronous,
and the synchronous `make_sstable_compressor_factory()` must disappear.

There are several possible ways to deal with this, but I think the
easiest one is to write an asynchronous replacement for
`make_sstable_compressor_factory()`
that will keep the same signature but will be only usable
in a `seastar::thread`.

All other uses of `make_sstable_compressor_factory()` outside of
`test_env::do_with()` already are in seastar threads,
so if we just get rid of `test_env::do_with()`, then we will
be able to use that thread-dependent replacement. This is the
purpose of this commit.

We shouldn't be losing much.
2025-05-07 13:19:21 +02:00
Nadav Har'El
7ccf77b84f test/alternator: another test for UpdateExpression's SET
I found on StackOverflow an interesting discussion about the fact that
DynamoDB's UpdateExpression documentation "recommends" to use SET
instead of ADD, and the rather convoluted expression that is actually
needed to emulate ADD using SET:
```
SET #count = if_not_exists(#count, :zero) + :one
```

https://stackoverflow.com/questions/14077414/dynamodb-increment-a-key-value

Although we do have separate tests for the different pieces of that
idiom - a SET with missing attribute or item, the if_not_exists()
function, etc. - I thought it would be nice to have a dedicated test
that verifies that this idiom actually works, and moreover that the more
naive "SET #count = #count + :one" does NOT work if the item or the
attribute are missing.

Unsurprisingly, the new test passes on both Alternator and DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23963
2025-05-07 13:57:50 +03:00
Nadav Har'El
b4a9fe9928 test/alternator: another test for expression with a lot of ORs
We already have a test, test_limits.py::test_deeply_nested_expression_2,
which checks that in the long condition expression

        a<b or (a<b or (a<b or (a<b or (....))))

with more than MAX_DEPTH (=400) repeats is rejected by Alternator,
as part of commit 04e5082d52 which
restricted the depth of the recursive parser to prevent crashing Scylla.

However, I got curious what will happen without the parentheses:

        a<b or a<b or a<b or a<b or ...

It turns out that our parser actually parses this syntax without
recursion - it's just a loop (a "*" in the Antlr alternator/expressions.g
allows reading more and more ORs in a loop). So Alternator doesn't limit
the length of this expression more than the length limit of 4096 bytes
which we also have. We can fit 584 repeats in the above expression in
4096 bytes, and it will not be rejected even though 584 > 400.
This test confirms that this is indeed the case.

The test is Scylla-only because on DynamoDB, this expression is rejected
because it has more than 300 "OR" operators. Scylla doesn't have this
specific limit - we believe the other limitations (on total expression
length, and on depth) are better for protecting Scylla. Remember that
in an expression like "(((((((((((((" there is a very high recursion
depth of the parser but zero operators, so counting the operators does
nothing to protect Scylla.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23973
2025-05-07 13:57:18 +03:00
Piotr Dulikowski
156ff8798b topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962
2025-05-07 13:51:41 +03:00
Aleksandra Martyniuk
20c2d6210e streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915
2025-05-07 11:51:56 +03:00
Anna Mikhlin
73b4c35601 Update ScyllaDB version to: 2025.3.0-dev 2025-05-07 11:43:11 +03:00
Pavel Emelyanov
6389099dfb Merge 'test/cluster/test_read_repair.py: improve trace logging test (again)' from Botond Dénes
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair     times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.

Fixes: scylladb/scylladb#23968

Needs backport to 2025.1 and 6.2, both have the flaky test

Closes scylladb/scylladb#23989

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: improve trace logging test (again)
  test/cluster: extract execute_with_tracing() into pylib/util.py
2025-05-07 10:32:45 +03:00
Botond Dénes
0a9ca52cfd replica/database: memtable_list: save ref to memtable_table_shared_data
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.

A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.

Fixes: #23762

Closes scylladb/scylladb#23984
2025-05-06 22:13:17 +03:00
Michał Chojnowski
1c1741cfbc logalloc_test: don't test performance in test background_reclaim
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test).
2025-05-06 18:59:18 +02:00
Michał Chojnowski
c47f438db3 logalloc: make background_reclaimer::free_memory_threshold publicly visible
Wanted by the change to the background_reclaim test in the next patch.
2025-05-06 18:59:18 +02:00
David Garcia
b1ee0e2a6a docs: fix AttributeError with 'myst_enable_extensions' in publication workflow
Rolled back some dependencies in `poetry.lock` to previous versions while we investigate how to make the extension `sphinx_scylladb_markdown` compatible with the latest versions.

This should fix the error in https://github.com/scylladb/scylladb/actions/runs/14708656912/job/41275115239, which currently prevents publishing new versions of https://opensource.docs.scylladb.com/

Closes scylladb/scylladb#23969
2025-05-06 16:33:00 +03:00
Pavel Emelyanov
1b5bbc2433 Merge 'test.py: split boost pytest integration' from Andrei Chekun
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial`
This is part two of splitting the original PR.

This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278.

Closes scylladb/scylladb#23882

* github.com:scylladb/scylladb:
  test.py: add awareness of extra_scylla_cmdline_options
  test.py: increase timeout for C++ tests in pytest
  test.py: switch method of finding the root repo directory
  test.py: move get_combined_tests to the correct facade
  test.py: add common directory for reports
  test.py: add the possibility to provide additional env vars
  test.py: move setup cgroups to the generic method
  test.py: refactor resource_gather.py
2025-05-06 16:22:49 +03:00
Raphael S. Carvalho
434c2c4649 replica: Fix use-after-free with concurrent schema change and sstable set update
When schema is changed, sstable set is updated according to the compaction
strategy of the new schema (no changes to set are actually made, just
the underlying set type is updated), but the problem is that it happens
without a lock, causing a use-after-free when running concurrently to
another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it
happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed
already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction,
which is triggered after new schema is set. Only strategy state and
backlog tracker are updated immediately, which is fine since strategy
doesn't depend on any particular implementation of sstable set, since
patch "sstables: Implement sstable_set_impl::all_sstable_runs()".

Fixes #22040.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-05-06 10:06:55 -03:00
Raphael S. Carvalho
628bec4dbd sstables: Implement sstable_set_impl::all_sstable_runs()
With upcoming change where table::set_compaction_strategy() might delay
update of sstable set, ICS might temporarily work with sstable set
implementations other than partitioned_sstable_set. ICS relies on
all_sstable_runs() during regular compaction, and today it triggers
bad_function_call exception if not overriden by set implementation.
To remove this strong dependency between compaction strategy and
a particular set implementation, let's provide a default implementation
of all_sstable_runs(), such that ICS will still work until the set
is updated eventually through a process that adds or remove a
sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-05-06 10:06:06 -03:00
Botond Dénes
3c3f6ca233 tools/scylla-sstable: scrub: use UUID sstable identifiers
Much easier to avoid sstable collisions. Makes it possible to scrub
multiple sstables, with multiple calls to scylla-sstable, reusing the
same output directory. Previously, each new call to scylla-sstable
scrub, would start from generation 0, guaranteeing collision.

Remove the unit test for generation clash -- with UUID generations, this
is no longer possible to reproduce in practice.

Refs: #21387

Closes scylladb/scylladb#23990
2025-05-06 15:09:53 +03:00
Patryk Jędrzejczak
7f843e0a5c Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky
Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack.

Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election.
To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal.

This change ensures a more stable voter distribution and reduces unnecessary voter reassignments.

The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations.

Fixes: scylladb/scylladb#23950
Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786

No backport: The limited voters feature is currently only present in master.

Closes scylladb/scylladb#23888

* https://github.com/scylladb/scylladb:
  raft: ensure topology coordinator retains votership
  raft: retain existing voters across data centers and racks
  raft: refactor limited voters calculator to prioritize nodes
  raft: replace pointer with reference for non-null output parameter
  raft: reduce code duplication in group0 voter handler
  raft: unify and optimize datacenter and rack info creation
2025-05-06 13:49:55 +02:00
Nadav Har'El
252c5b5c9d Merge 'Alternator batch_write_item wcu' from Amnon Heiman
This series adds support for WCU tracking in batch_write_item and tests it.

The patches include:

Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.

Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.

Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.

The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.

Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.

Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.

**Need backport, WCU is partially supported, and is missing from batch_write_item**

Fixes #23940

Closes scylladb/scylladb#23941

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: batch_write validate WCU
  alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
  alternator/executor: add WCU for batch_write_items
  alternator/consumed_capacity: make wcu get_units public
  Alternator: Change the WCU/RCU to use units
2025-05-06 13:31:53 +03:00
Gleb Natapov
7403de241c test: add reproducer for #22777
Add sleep before starting gossiper to increase a chance of getting old
gossiper entry about yourself before updating local gossiper info with
new IP address.
2025-05-06 11:21:17 +03:00
Botond Dénes
29eedaa0e5 test/cluster/test_read_repair.py: improve trace logging test (again)
The test test_read_repair_with_trace_logging wants to test read repair
with trace logging. Turns out that node restart + trace-level logging
+ debug mode is too much and even with 1 minute timeout, the read repair
times out sometimes.
Refactor the test to use injection point instead of restart. To make
sure the test still tests what it supposed to test, use tracing to
assert that read repair did indeed happen.
2025-05-06 01:35:17 -04:00
Avi Kivity
fc2204cea0 Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it made the test flaky.
Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them.

Fixes: #16794

Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there

Closes scylladb/scylladb#23783

* github.com:scylladb/scylladb:
  test/boost/multishard_mutation_query_test: ensure test runs with vnodes
  test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
2025-05-05 20:49:03 +03:00
Emil Maskovsky
24dfd2034b raft: ensure topology coordinator retains votership
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.

This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.

Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.

Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786
2025-05-05 16:58:34 +02:00
Emil Maskovsky
2ae59e8a87 raft: retain existing voters across data centers and racks
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.

Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.

Improved the prioritization logic to account for the number of existing
voters in each data center and rack.

This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.

Fixes: scylladb/scylladb#23950
2025-05-05 16:51:48 +02:00
Emil Maskovsky
018fb63305 raft: refactor limited voters calculator to prioritize nodes
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.

The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
2025-05-05 16:36:17 +02:00
Emil Maskovsky
26fdc7b8f8 raft: replace pointer with reference for non-null output parameter
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.

Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
2025-05-05 16:12:00 +02:00
Emil Maskovsky
f0468860a3 raft: reduce code duplication in group0 voter handler
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.

This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
2025-05-05 16:09:53 +02:00
Botond Dénes
855411caad test/boost/multishard_mutation_query_test: ensure test runs with vnodes
All tests in this suite use the default "ks" keyspace from cql_test_env.
This keyspace has tablet support and at any time we might decide to make
it use tablets by default. This would make all these tests use the
tablet path in multishard_mutation_query.cc. These tests were created to
test the vastly more complex vnodes code path in said file. The tablet
path is much simpler and it is only used by SELECT * FROM
MUTATION_FRAGMENTS() and which has its own correctness tests.
So explicitely create a vnodes keyspace and use it in all the tests to
restore the test functionality.
2025-05-05 09:22:54 -04:00
Botond Dénes
1175e1ed49 test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from
  2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario,
  this throws off per-page increment checks, when the previous scenario
  moved these metrics and they don't start from 0; this causes the test
  to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it
made the test flaky.
2025-05-05 09:22:53 -04:00
Emil Maskovsky
2ef654149f raft: unify and optimize datacenter and rack info creation
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.

Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.

Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.
2025-05-05 15:15:17 +02:00
Pavel Emelyanov
cf1ffd6086 Merge 'sstables_loader: fix the racing between get_progress() and release_resources()' from Kefu Chai
This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.

Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.

Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting

Fixes #23801

---

this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.

Closes scylladb/scylladb#23808

* github.com:scylladb/scylladb:
  sstables_loader: fix the indent
  sstables_loader: fix the racing between get_progress() and release_resources()
2025-05-05 15:45:15 +03:00
Avi Kivity
e688e89430 tools: toolchain: clear .cache and .cargo directories
The .cache and .cargo directories are used during pip and rust builds
when preparing the toolchain, but aren't useful afterwards. Remove them
to save a bit of space.

Closes scylladb/scylladb#23955
2025-05-05 14:43:14 +03:00
Avi Kivity
4c1f4c419c tools: toolchain: dbuild: run as root in container under podman
Running as root enables nested containers under podman without
trouble from uid remapping. Unlike docker, under podman uid 0 in
the container is remapped to the host uid for bind mounts, so writes
to the build directory do not end up owned by root on the host.

Nested containers will allow us to consume opensearch, cassandra-stress,
and minio as containers rather than embedding them into the frozen
toolchain.

Closes scylladb/scylladb#23954
2025-05-05 14:40:43 +03:00
Amnon Heiman
2ab99d7a07 alternator/test_metrics.py: batch_write validate WCU
This patch adds a test that verifies the WCU metrics are updated
correctly during a batch_write_item operation.
It ensures that the WCU of each item is calculated independently.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:24 +03:00
Amnon Heiman
14570f1bb5 alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
This patch adds two tests:
A test that validates WCU calculation for batch put requests on a single table.

A test that validates WCU calculation for batch requests across multiple
tables, including ensuring that delete operations are counted as 1 WCU.

Both tests verify that the consumed capacity is reported correctly
according to the WCU rules.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:23 +03:00
Amnon Heiman
68db77643f alternator/executor: add WCU for batch_write_items
This patch adds consumed capacity unit support to batch_write_item.

It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.

The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.

For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:14 +03:00
Amnon Heiman
f2ade71f4f alternator/consumed_capacity: make wcu get_units public
This patch adds a public static get_units function to
wcu_consumed_capacity_counter.  It will be used by the batch write item
implementation, which cannot use the wcu_consumed_capacity_counter
directly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

consume_capacity need merge
2025-05-05 13:19:04 +03:00
Amnon Heiman
5ae11746fa Alternator: Change the WCU/RCU to use units
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:

Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.

Update the tests accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:18:09 +03:00
Anna Stuchlik
851a433663 doc: add a link to the previous Enterprise documentation
This commit adds a link to the docs for previous Enterprise versions
at https://enterprise.docs.scylladb.com/ to the left menu.

As we still support versions 2024.1 and 2024.2, we need to ensure
easier access to those docs sets.

Fixes https://github.com/scylladb/scylladb/issues/23870

Closes scylladb/scylladb#23945
2025-05-05 12:16:47 +03:00
Avi Kivity
04fb2c026d config: decrease default large allocation warning threshold to 128k
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.

We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.

Closes scylladb/scylladb#23975
2025-05-05 12:13:48 +03:00
Pavel Emelyanov
b56d6fbb84 Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

Closes scylladb/scylladb#23806

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-05-05 11:28:38 +03:00
David Garcia
4ba7182515 docs: fix md redirections for multiversion support
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.

The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.

This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.

Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.

Closes scylladb/scylladb#23957
2025-05-05 10:39:39 +03:00
Pavel Emelyanov
7b786d9398 topology_coordinator: Use this->_feature_service directly
This dependency is already there, topology coordinator doesn't need
to use database reference to get to the features.

Previous patch of the same kind: b79137eaa4

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23777
2025-05-05 09:37:29 +02:00
Piotr Dulikowski
05c797795f Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov
It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet.

Closes scylladb/scylladb#23815

* github.com:scylladb/scylladb:
  test: Remove sstable_assertions::get_stats_metadata()
  test: Add sstable_assertions::operator->()
2025-05-05 09:33:45 +02:00
Nadav Har'El
834107ae97 test/cqlpy,alternator: fix reporting of Scylla crash during test
The cqlpy and alternator test frameworks use a single Scylla node started
once for all tests to run on. In the distant past, we had a problem where
if one test caused Scylla to crash, the result was a confusing report of
hundreds of failed tests - all tests after the crash "failed" and it wasn't
easy to find which test really caused the crash.

Our old solution to this problem was to have an autouse fixture (called
cql_test_connection or dynamodb_test_connection) which tested the
connection at the end of each test, and if it detected Scylla has
crashed - it used pytest.exit() to report the error and have pytest
exit and therefore stop running any further tests (which would have
led to all of them testing).

This approach had two problems:

1. The pytest.exit() caused the entire cqlpy suite to report a failure,
   but but not the individual test - the individual test might have
   failed as well, but that isn't guaranteed and in any case this test's
   output is missing the informative message that Scylla crashed during
   the test. This was fine when for each cqlpy failure we had two separate
   error logs in Jenkins - the specific failed function, and the failed
   file - but when we recently got rid of the suplication by removing the
   second one, we no longer see the "Scylla crashed" messages any more.

2. Exiting pytest will be the wrong thing to do if the same pytest
   run could run tests from different test suites. We don't do this
   today, but we plan to support this approach soon.

This patch fixes both problems by replacing the pytest.exit() call by
setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail()
causes the current test - the one which caused Scylla to crash - to be
reported as an "ERROR" and the "Scylla crashed" message will correctly
appear in this test's log. The flag will cause all other tests in the
same test suite to be skip()ed. But other tests in other directories,
depending on different fixtures, might continue to run normally.

Fixes #23287

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23307
2025-05-05 10:15:56 +03:00
Nadav Har'El
3ce7e250cc alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827
2025-05-05 09:59:08 +03:00
Pavel Emelyanov
d40d6801b0 sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912
2025-05-05 09:45:44 +03:00
Pavel Emelyanov
e0f30a30a7 sstable_directory: Print unshared remote sstable when sorting
When collecting sstables, the sstable_directory may sort the collected
descriptors into one of three buckets -- unshared local and remote, and
shared ones. Unshared local and shared sstables' paths are loggerd (with
trace level) while unshared remote is silently collected for further
processing. Add log message for that case too, there's enough data to
print the sstable path as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23913
2025-05-05 09:33:06 +03:00
Gleb Natapov
ecd14753c0 storage_service: Do not remove gossiper entry on address change
When gossiper indexed entries by ip an old entry had to be removed on an
address change, but the index is id based, so even if ip was change the
entry should stay. Gossiper simply updates an ip address there.
2025-05-04 17:59:07 +03:00
Gleb Natapov
a2178b7c31 storage_service: use id to check for local node
IP may change and an old gossiper message with previous IP may be
processed when it shouldn't.

Fixes: #22777
2025-05-04 17:59:07 +03:00
Botond Dénes
51025de755 test/cluster: extract execute_with_tracing() into pylib/util.py
To allow reuse in other tests.
2025-05-02 01:53:35 -04:00
Piotr Dulikowski
8ffe4b0308 utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952
2025-04-30 16:43:22 +03:00
Benny Halevy
4bd0845fce gossiper: make send_gossip_echo cancellable
Currently send_gossip_echo has a 22 seconds timeout
during which _abort_source is ignored.

Mark the verb as cancellable so it can be canceled
on shutdown / abort.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-30 11:46:10 +03:00
Benny Halevy
fa1c3e86a9 gossiper: add send_echo helper
CAll send_gossip_echo using a centralized helper.
A following patch will make it abortable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-30 11:45:51 +03:00
Benny Halevy
0b97806771 idl, message: make with_timeout and cancellable verb attributes composable
And define `send_message_timeout_cancellable` in rpc_protocol_impl.hh
using the newly introduced rpc_handler entry point
in seastar that accepts both timeout and cancellable params.

Note that the interface to the user still uses abort_source
while internally the funtion allocates a seastar::rpc::cancellable
object.  It is possible to provide an interface that will accept
a rpc::cancellable& from the caller, but the existing messaging api
uses abort_source.  Changing it may be considered in the future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-30 11:45:51 +03:00
Benny Halevy
e06d226d08 gossiper: failure_detector_loop_for_node: ignore abort_requested_exception
Aborting the failure detector happens normally
when the node shuts down.

There's no need to log anything about it,
as long as we abort the function cleanly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-30 11:05:24 +03:00
Benny Halevy
83c69642f7 gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition
The same as the loop condition in the direct_failure_detector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-30 11:05:24 +03:00
Aleksandra Martyniuk
1f4edd8683 test_tablet_tasks: use injection to revoke resize
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.

Use error injection to trigger resize revoke.

Fixes: #22570.

Closes scylladb/scylladb#23966
2025-04-30 07:04:57 +03:00
Michał Chojnowski
9e2343ecb0 test_sstable_compression_dictionaries_autotrain: raise the timeout
There were CI runs in which the training happened as planned,
but it was too slow to fit within the timeout.

Raise the timeout to pacify the CI.

Fixes scylladb/scylladb#23964

Closes scylladb/scylladb#23965
2025-04-29 22:09:14 +03:00
Raphael S. Carvalho
d5bee4c814 test: Verify partitioned set store split and unsplit correctly
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
c77f710a0c sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
494ed6b887 dht: Add overlap_ratio() for token range
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak
0cdcf82cd0 Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

Closes scylladb/scylladb#23914

* https://github.com/scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-04-28 12:45:33 +02:00
Michał Chojnowski
7f9152babc utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.
2025-04-28 12:30:13 +02:00
Botond Dénes
d582c436e5 Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#23787

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-04-28 09:32:45 +03:00
Nadav Har'El
262530f27c Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as we'll as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). To achieve this, in this series we remove
all base_info members that can change due to a base schema update,
and we calculate the remaining values during view update generation,
using the most up-to-date base schema version.

To calculate the values that depend on the base schema version, we
need to iterate over the view primary key and find the corresponding
columns, which adds extra overhead for each batch of view updates.
However, this overhead should be relatively small, as when creating
a view update, we need to prepare each of its columns anyway. And
if we need to read the old value of the base row, the relative
overhead is even lower.

After this change, the base info in view schemas stays the same
for all base schema updates, so we'll no longer get issues with
base_info being incompatible with a base schema version. Additionally,
it's a step towards making the schema objects immutable, which
we sometimes incorrectly assumed in the past (they're still not
completely immutable yet, as some other fields in view_info other
than base_info are initialized lazily and may depend on the base
schema version).

Fixes https://github.com/scylladb/scylladb/issues/9059
Fixes https://github.com/scylladb/scylladb/issues/21292
Fixes https://github.com/scylladb/scylladb/issues/22194
Fixes https://github.com/scylladb/scylladb/issues/22410

Closes scylladb/scylladb#23337

* github.com:scylladb/scylladb:
  test: remove flakiness from test_schema_is_recovered_after_dying
  mv: add a test for dropping an index while it's building
  base_info: remove the lw_shared_ptr variant
  view_info: don't re-set base_info after construction
  base_info: remove base_info snapshot semantics
  base_info: remove base schema from the base_info
  schema_registry: store base info instead of base schema for view entries
  base_info: make members non-const
  view_info: move the base info to a separate header
  view_info: move computation of view pk columns not in base pk to view_updates
  view_info: move base-dependent variables into base_info
  view_info: set base info on construction
2025-04-27 19:12:12 +03:00
David Garcia
cf7d846b9e docs: update dependencies
This is a mandatory dependency update to resolve a critical Dependabot alert. For more details, see the [Dependabot alerts](https://docs.github.com/en/code-security/dependabot/dependabot-alerts/viewing-and-updating-dependabot-alerts).

Closes scylladb/scylladb#23918

Fixes #23935
2025-04-27 18:45:11 +03:00
Piotr Szymaniak
e588c8667f alternator: Limit attribute name lengths
Attribute names are now checked against DynamoDB-compatible length
limits. When exceeded, Alternator emits exception identical or similar
to the DDB one. It might be worth noting that DDB emits more than a
single kind of an exception string for some exceptions. The tests'
catch clauses handle all the observed kinds of messages from DynamoDB.
The validation differentiates between key and non-key attributes and
applies the limit accordingly.

AWS DDB raises exceptions with somewhat different contents when the
get request contains ProjectionExpression, so this case needed separate
treatment to emit the corresponding exception string. The
length-validating function was declared and defined in
expressions.hh/.cc respectively, because that's where the relevant
parsing happens.

** Tests

The following tests were validated when handling this issue:
test_limit_attribute_length_nonkey_good,
test_limit_attribute_length_nonkey_bad,
test_limit_attribute_length_key_good,
test_limit_attribute_length_key_bad,
test_limit_attribute_length_gsi_lsi_good,
test_limit_attribute_length_gsi_lsi_bad,
test_limit_attribute_length_gsi_lsi_projection_bad.

Some of the tests were expanded into being more granular. Namely, there
is a new test function
`test_limit_attribute_length_key_bad_incoherent_names`
which groups tests with too long attribute names in the case of
incorrect (incoherent) user requests.
Similarily, there is a new test function
`test_limit_attribute_length_gsi_lsi_bad_incoherent_names`
All the tests cover now each combination of the key/keys being too long.
Both the new fuctions contain tests that verify that ScyllaDB throws
length-related exceptions (instead of the coherency-related), similar
to what DynamoDB does.

The new test test_limit_gsiu_key_len_bad covers the case of too long
attribute name inside GlobalSecondaryIndexUpdates.
The new test test_limit_gsiu_key_len_bad_incoherent_names covers the
case of incorrect (incoherent) user requests containing too long
attribute names and GlobalSecondaryIndexUpdates.

test_limit_attribute_length_key_bad was found to have contaned an
illegal KeySchema structure.

Some of the tests were corrected their match clause.

All the tests are stripped of the xfail flag except
test_limit_attribute_length_key_bad, which has it changed since it
still fails due to Projection in GSI and LIS not implemented in Alternator.
The xfail now points to #5036.

Fixes scylladb/scylladb#9169

Closes scylladb/scylladb#23097
2025-04-27 18:39:20 +03:00
Piotr Dulikowski
82e1678fbe test: mv: skip test_mv_tablets_empty_ip in debug mode
This test shuts down a node and then replaces it with another one while
continuously writing to the cluster. The test has been observed to take
a lot of time in debug mode and time out on the replace operation.
Replace takes very long because rebuilding tablets on the new node is
very slow, and the slowest part is memtable flush which happens at the
beginning of streaming. The slowness seems to be specific to the debug
mode.

Turn off the test in debug mode to deflake the CI. As a follow-up, the
test is planned to be reworked into an quicker error injection test so
that the code path tested by this test will be again exercised in debug
unit tests (scylladb/scylladb#23898)

Fixes: scylladb/scylladb#20316

Closes scylladb/scylladb#23900
2025-04-27 18:06:08 +03:00
Piotr Dulikowski
670a69007e test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.
2025-04-25 12:25:15 +02:00
Piotr Dulikowski
845cedea7f topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
2025-04-25 11:30:01 +02:00
Piotr Dulikowski
66acaa1bf8 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.
2025-04-25 11:25:07 +02:00
Aleksandra Martyniuk
76cd707b18 test: test_tablets: wait for cql
Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
to prevent failing queries.

Fixes: #23620.

Closes scylladb/scylladb#23796
2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak
2a8bb47cfb test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes
Providing IP of an ignored node during removenode made the test flaky.
It could happen that the address map contained mappings of two
nodes with the same IP:
1. the node being ignored,
2. the node that expectedly failed replacing earlier in the test.

So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps`
could return the host ID of the second node instead of the first node
and cause removenode to fail.

We fix flakiness in this patch by providing the host ID of the ignored
node instead of its IP. We would have to do it anyway sooner or later
because providing IP is deprecated.

The bug in `find_raft_nodes_from_hoeps` is tracked by
scylladb/scylladb#23846.

The test became flaky because of f0af3f261e.
That patch is not present in 2025.1, so the test isn't flaky outside
master, and hence there is no reason to backport this patch.

Fixes scylladb/scylladb#23499

Closes scylladb/scylladb#23863
2025-04-24 20:17:19 +03:00
Pavel Emelyanov
68a178eba9 Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

Closes scylladb/scylladb#23876

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-04-24 20:02:59 +03:00
Andrei Chekun
22ef09489d test.py: add awareness of extra_scylla_cmdline_options
test_config.yaml can have field extra_scylla_cmdline_options that
previously was not added to the commandline to start Scylla. Now any
extra options will be added to commandline to start tests
2025-04-24 14:05:50 +02:00
Andrei Chekun
2758c4a08e test.py: increase timeout for C++ tests in pytest
Current timeouts it not enough. Tests failed randomly with hitting
timeout. This will allow to test finish normally. As a downside if the
process will hang we will be waiting more. This adjustments will be
changed after we will have metrics how long it takes to test to pass in
each mode.
2025-04-24 14:05:50 +02:00
Andrei Chekun
f5c88e1107 test.py: switch method of finding the root repo directory
Switching to use constant defined in __init__ filet instead of getting
the root directory from pytest's config. This is will allow to have only
one source of truth in defining the  root directory of the project to
avoid cases when root directory defined incorrectly. This change also
simplifies potential changes in future.
2025-04-24 14:05:50 +02:00
Andrei Chekun
06eca04370 test.py: move get_combined_tests to the correct facade
Since get_combined_tests method is used only for boost tests and not all C++ tests, moving it into the correct place
2025-04-24 14:05:49 +02:00
Andrei Chekun
8cc9c0a53a test.py: add common directory for reports
When test.py executing python test it executes it by mode and by file,
so it can say where the report should with mode. With new approach
pytest will execute the tests for all modes inside himself, and we can
only have one report per pytest invocation. That's why we need common
directory for reports and not under the mode directory. It can later be
used for simplification, so any report should be there.
2025-04-24 14:05:49 +02:00
Andrei Chekun
b791af1f16 test.py: add the possibility to provide additional env vars
This will allow inject any environment variable to the test, because
previosly it was taking only the environment variables from the process.
Adding injecting ASAN and UBSAN variablet to the tests
2025-04-24 14:05:49 +02:00
Andrei Chekun
3cb5838619 test.py: move setup cgroups to the generic method
This changes needed for later integration for pytest executing the C++
tests to be able to gather resource metric.
2025-04-24 14:05:49 +02:00
Andrei Chekun
ca615af407 test.py: refactor resource_gather.py
Refactor resource_gather.py to not create the initial cgroup when the process it's already in it. This will allow not going deeper, creating again and again the same cgroup with each test.py execution when the terminal isn't closed.
Add creation of own event loop in case it's not exists. This needed to be able to work with
test.py that creates loop and with pytest that not create loop.
2025-04-24 14:05:49 +02:00
Wojciech Mitros
ee5883770a test: remove flakiness from test_schema_is_recovered_after_dying
Due to the changes in creating schemas with base info the
test_schema_is_recovered_after_dying seems to be flaky when checking
that the schema is actually lost after 'grace_period'. We don't
actually guarantee that the the schema will be lost at that exact
moment so there's no reason to test this. To remove the flakiness,
we remove the check and the related sleep, which should also slightly
improve the speed of this test.
2025-04-24 01:09:35 +02:00
Wojciech Mitros
bf7bba9634 mv: add a test for dropping an index while it's building
Dropping an index is a schema change of its base table and
a schema drop of the index's materialized view. This combination
of schema changes used to cause issues during view building, because
when a view schema was dropped, it wasn't getting updated with the
new version of the base schema, and while the view building was
in progress, we would update the base schema for the base table
mutation reader and try generating updates with a view schema that
wasn't compatible with the base schema, failing on an `on_internal_error`.

In this patch we add a test for this scenario. We create an index,
halt its view building process using an injection, and drop it.
If no errors are thrown, the test succeeds.

The test was failing before https://github.com/scylladb/scylladb/pull/23337
and is passing afterwards.
2025-04-24 01:09:32 +02:00
Wojciech Mitros
d77f11d436 base_info: remove the lw_shared_ptr variant
The base_dependent_view_info is no longer needed to be shared or
modified in the view_info, so we no longer need to keep it as
a shared pointer.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
d7bd86591e view_info: don't re-set base_info after construction
In the previous commits we made sure that the base info is not dependent
on the base schema version, and the info dependent on the base schema
version is calculated when it's needed. In this patch we remove the
unnecessary re-setting of the base_info.

The set_base_info method isn't removed completely, because it also has
a secondary function - zeroing the view_info fields other than base_info.
Because of this, in this patch we rename it accordingly and limit its
use to the updates caused by a base schema change.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ea462efa3d base_info: remove base_info snapshot semantics
The base info in view schemas no longer changes on base schema
updates, so saving the base info with a view schema from a specific
point in time doesn't provide any additional benefits.
In this patch we remove the code using the base_and_view snapshots
as it's no longer useful.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ad55935411 base_info: remove base schema from the base_info
The base info now only contains values which are not reliant on the
base schema version. We remove the the base schema from the base info
to make it immutable regardless of base schema version, at the point
of this patch it's also not needed anywhere - the new base info can
replace the base schema in most places, and in the few (view_updates)
where we need it, we pull the most recent base schema version from
the database.

After this change, the base info no longer changes in a view schema
after creation, so we'll no longer get errors when we try generating
view updates with a base_info that's incompatible with a specific
base schema version.

Fixes #9059
Fixes #21292
Fixes #22410
2025-04-24 01:08:39 +02:00
Wojciech Mitros
05fce91945 schema_registry: store base info instead of base schema for view entries
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.

This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema
2025-04-24 01:08:39 +02:00
Wojciech Mitros
6e539c2b4d base_info: make members non-const
In the following patches we'll add the base info instead of the
base schema to various places (schema building, schema registry).
There, we'll sometimes need to update the base_info fields, which
we can't do with const members. There's also a place (global_schema_ptr)
where we won't be able to use the base_info_ptr (a shared pointer to the
base_info), so we can't just use the base_info_ptr everywhere instead.

In this patch we unmark these members as const.
In the following patches we'll remove the methods for changing the
base_info in the view schema, so it will remain effectively const.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
32258d8f9a view_info: move the base info to a separate header
In the following commits the base_depenedent_view_info will be needed
in many more places. To avoid including the whole db/view/view.hh
or forward declaring (where possible) the base info, we move it to
a separate header which can be included anywhere at almost no cost.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
a3d2cd6b5e view_info: move computation of view pk columns not in base pk to view_updates
In preparation of making the base_info immutable, we want to get rid of
any base_dependent_view_info fields that can change when base schema
is updated.
The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk
base column_ids of corresponding base columns and they can change
(decrease) when an earlier column is dropped in the base table.
view_updates is the only location where these values are used and calculating
them is not expensive when comparing to the overall work done while performing
a view update - we iterate over all view primary key columns and look them up
in the base table.
With this in mind, we can just calculate them when creating a view_updates
object, instead of keeping them in the base_info. We do that in this patch.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
a33963daef view_info: move base-dependent variables into base_info
The has_computed_column_depending_on_base_non_primary_key
and is_partition_key_permutation_of_base_partition_key variables
in the view_info depend on the base table so they should be in the
base_dependent_view_info instead of view_info.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
900687c818 view_info: set base info on construction
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as well as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). The first step towards that is making
sure that all newly created schemas have the base info set.
We achieve that by requiring a base schema when constructing a view
schema. Unfortunately, this adds complexity each time we're making
a view schema - we need to get the base schema as well.
In most cases, the base schema is already available. The most
problematic scenario is when we create a schema from mutations:
- when parsing system tables we can get the schema from the
database, as regular tables are parsed before views
- when loading a view schema using the schema loader tool, we need
to load the base additionally to the view schema, effectively
doubling the work
- when pulling the schema from another node - in this case we can
only get the current version of the base schema from the local
database

Additionally, we need to consider the base schema version - when
we generate view updates the version of the base schema used for
reads should match the version of the base schema in view's base
info.
This is achieved by selecting the correct (old or new) schema in
`db::schema_tables::merge_tables_and_views` and using the stored
base schema in the schema_registry.
2025-04-24 01:08:39 +02:00
Benny Halevy
f279625f59 test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error
The query may fail also on a no_such_keyspace
exception, which generates the following cql error:
```
Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq"
```
Extend the pytest.raises match expression to include
this error as well.

Fixes #23812

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23875
2025-04-23 18:54:22 +03:00
Benny Halevy
2bbdaeba1c Update seastar submodule
* seastar e44af9b0...d7ff58f2 (2):
  > rpc: client: support timeout and cancellation
  > doc/io-properties-file.md: correct a typo

Closes scylladb/scylladb#23865
2025-04-23 16:10:51 +03:00
Aleksandra Martyniuk
c1618c7de5 test: test table drop during flush 2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk
91b57e79f3 replica: skip flush of dropped table 2025-04-23 14:29:28 +02:00
Kefu Chai
0d7752b010 build: cmake: generalize update_cxx_flags()
Refactor our CMake flag handling to make it more flexible and reduce
repetition:

- Rename update_cxx_flags() to update_build_flags() to better reflect
  its expanded purpose
- Generate CMake variable names internally based on configuration type
  instead of requiring callers to specify full variable names
- Follow CMake's standard naming conventions for configuration-specific
  flags, see
  https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_FLAGS.html#variable:CMAKE_%3CLANG%3E_FLAGS
- Prepare groundwork for handling linker flags in addition to compiler
  flags in future changes

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23842
2025-04-23 12:06:04 +03:00
Nadav Har'El
64a5eee6b9 test/cqlpy: insert test names into Scylla logs
Both test.py and test/cqlpy/run run many test functions against the same
Scylla process. In the resulting log file, it is hard to understand which
log messages are related to which test. In this patch, we log a message
(using the "/system/log" REST API) every time a test is started or ends.

The messages look like this:

    INFO  2025-04-22 15:10:44,625 [shard 1:strm] api - /system/log:
    test/cqlpy: Starting test_lwt.py::test_lwt_missing_row_with_static
    ...
    INFO  2025-04-22 15:10:44,631 [shard 0:strm] api - /system/log:
    test/cqlpy: Ended test_lwt.py::test_lwt_missing_row_with_static

We already had a similar feature in test/alternator, added three years
ago in commit b0371b6bf8. The implementation
is similar but not identical due to different available utility functions,
and in any case it's very simple.

While at it, this patch also fixes the has_rest_api() to timeout after
one second. Without this, if the REST API is blocked in a way that
a connection attempt just hangs, the tests can hang. With the new
timeout, the test will hang for a second, realize the REST API is
not available, and remember this decision (the next tests will not
wait one second again). We had the same bug in Alternator, and fixed
it in 758f8f01d7. This one second "pause"
will only happen if the REST API port is blocked - in the more typical
case the REST API port is just not listening but not blocked, and the
failure will be noticed immediately and won't wait a whole second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23857
2025-04-23 12:04:14 +03:00
Piotr Dulikowski
3d73c79a72 test: mv: skip test_view_building_scheduling_group in debug
The test populates a table with 50k rows, creates a view on that table
and then compares the time spent in streaming vs. gossip scheduling
groups. It only takes 10s in dev mode on my machine, but is much slower
in debug mode in CI - building the view doesn't finish within 2 minutes.

The bigger the view to build, the more accurrate the measurement;
moreover, the test scenario isn't interesting enough to be worth running
it in debug mode as this should be covered by other tests. Therefore,
just skip this test in debug mode.

Fixes: scylladb/scylladb#23862

Closes scylladb/scylladb#23866
2025-04-23 11:29:35 +03:00
Pavel Emelyanov
a6ba535c3c Merge 'test.py: refactoring before boost pytest integration' from Andrei Chekun
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change though is switching the SQLite writer from a singleton to a thread locking mechanism that will be needed later on.

This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer [request](https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278).

Closes scylladb/scylladb#23867

* github.com:scylladb/scylladb:
  test.py: move the readme file for LDAP tests to the correct location
  test.py: eliminate deprecation warning for xml.etree.ElementTree.Element
  test.py: align the behavior of max-failures parameter with pytest maxfail
  test.py: fix typo in toxiproxy name parameter
  test.py: add locking to the sqlite writer for resource gather
  test.py: add sqlite datetime adapter for resource gather
  test.py: change the parameter for get_modes_to_run()
2025-04-23 11:10:56 +03:00
Andrzej Jackowski
3c69340b8c test: add test_long_query_timeout_erm
This commit adds a test to verify that a query with long timeout
doesn't block ERM on failure. The motivation for the test is
fixing scylladb#21831.

This commit:
 - add test_long_query_timeout_erm
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
1f1e4f09cd test: add get_cql_exclusive to manager_client.py
This commit adds to ManagerClient a get_cql_exclusive function that
allows creating a cql connection with WhiteListRoundRobinPolicy for
a single server. Such connection is useful in tests that kill nodes to
make sure that the live node handles the queries. Before this commit,
some tests used cluster_con from test/cluster/conftest.py, and after
this commit test can start to use a method from MangerClient.

This change:
 - Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg
 - Implement get_cql_exclusive()
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
9d53063a7e mapreduce: catch local read_failure_exception_with_timeout
Mapreduce Service exception handling differs for local and remote RPC
calls of dispatch_to_shards. Whereas local exceptions are handled
normally, the remote exceptions are converted to rpc::remote_verb_error
by the framework. This is a substantial difference when
read_failure_exception_with_timeout is thrown during mapreduce query
execution - CQL server waits for the exception from the local call but
not from the remote one.

As we don't want to wait for the timeout in CQL server in either of
the cases, this commit catches the local exception (especially
read_failure_exception_with_timeout) and converts it to
std::runtime_error (the one from which rpc::remote_verb_error inherits).

Ideally, Mapreduce Service should execute dispatch_to_shards through RPC
for both local and remote calls. However, such change negatively affects
tens of Unit Tests that rely on the possibility to run local mapreduce
service without any RPC.

This change:
 - Catch local exceptions in Mapreduce Service and convert them
   to std::runtime_error.
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
1fca994c7b transport: storage_proxy: release ERM when waiting for query timeout
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.

This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).

This change:
 - Add new read_failure_exception_with_timeout exception
 - Add throw of read_failure_exception_with_timeout in storage_proxy
 - Add abort_source to CQL server, as well as to_stop() method for
   the correct abort handling
 - Add sleep in CQL server when the new exception is caught

Refs #21831
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
9b1f062827 transport: remove redundant references in process_request_one
The references were added and used in previous commits to
limit the number of line changes for a reviewer convenience.

This commit removes the redundant references to make the code
more clear and concise.
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
9c0f369cf8 transport: fix the indentation in process_request_one
Fix the indentation after the previous commit that intentionally had
a wrong indent to limit the number of changed lines
2025-04-23 09:29:47 +02:00
Andrzej Jackowski
8a7454cf3e transport: add futures in CQL server exception handling
Prepare for the next commit that will introduce a
seastar::sleep in handling of selected exception.

This commit:
 - Rewrite cql_server::connection::process_request_one to use
   seastar::futurize_invoke and try_catch<> instead of
   utils::result_try.
 - The intentation is intentionally incorrect to reduce the
   number of changed lines. Next commits fix it.
2025-04-23 09:29:05 +02:00
Andrei Chekun
57b66e6b2e test.py: move the readme file for LDAP tests to the correct location
README file was created in incorrect location, now it moved to the
directory with source files where it intended to be.
2025-04-22 19:03:28 +02:00
Andrei Chekun
cf4747c151 test.py: eliminate deprecation warning for xml.etree.ElementTree.Element
Testing the truth value of an Element emits DeprecationWarning. This check is done correctly
2025-04-22 19:03:21 +02:00
Andrei Chekun
bc49cd5214 test.py: align the behavior of max-failures parameter with pytest maxfail
This will allow to just transfer the existing max-failures values to the
pytest without any modification. As a downside test.py logic of handling
these changes slightly.
2025-04-22 19:03:08 +02:00
Andrei Chekun
5c3501e4bf test.py: fix typo in toxiproxy name parameter
Fix typo in toxiproxy name parameter. No any functional changes just
cosmetic fix.
2025-04-22 19:02:12 +02:00
Andrei Chekun
2c37a793d1 test.py: add locking to the sqlite writer for resource gather
SQLite blocking the DB during writes, so it's not possible to make writes from
several thread. To be able to gather metrics in several threads, we need a
locking mechanism for threads during writes. So thread will not try to
write metrics while another thread is performing writes.
2025-04-22 19:01:30 +02:00
Andrei Chekun
800710dc2c test.py: add sqlite datetime adapter for resource gather
Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12
2025-04-22 18:59:49 +02:00
Andrei Chekun
bf2a9e267e test.py: change the parameter for get_modes_to_run()
Change the parameter for get_modes_to_run() from session to config to
narrow the scope, and prepare it to later use in method that do not have
access to the session, but have access to the config object
2025-04-22 18:58:33 +02:00
Kefu Chai
7254c0c515 db/config.cc: correct a typo in option's description
s/incomming/incoming/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23826
2025-04-22 16:55:04 +03:00
Pavel Emelyanov
65efd2b2f6 Merge 'Refactor and enhance s3_tests' from Ernest Zaslavsky
This PR introduces a cleanup mechanism in s3_tests to remove uploaded objects after the test completes, ensuring a clean testing environment. Additionally, the recently added test has been refactored and split into smaller, more maintainable parts, improving readability and extending its coverage to include the "proxied" case.

As these changes primarily improve code aesthetics and maintainability, backporting is not necessary.

Refs: https://github.com/scylladb/scylladb/issues/23830

Closes scylladb/scylladb#23828

* github.com:scylladb/scylladb:
  s3_tests: Improve and extend copy object test coverage
  s3_tests: Implement post-test cleanup for uploaded objects
2025-04-22 16:40:37 +03:00
Nadav Har'El
5fd2eabd48 Merge 'Generalize the diversity of parse_table_infos() callers in API' from Pavel Emelyanov
The helper in question is used in several different ways -- by handlers directly (most of the callers), as a part of wrap_ks_cf() helper and by one of its overloads that unpack the "cf" query parameter from request. This PR generalizes most of the described callers thus reducing the number differently-looking of ways API handlers parse "keyspace" and "cf" request parameters.

Continuation of #22742

Closes scylladb/scylladb#23368

* github.com:scylladb/scylladb:
  api: Squash two parse_table_infos into one
  api: Generalize keyspaces:tables parsing a little bit more
  api: Provide general pair<keyspace, vector<table>> parsing
  api: Remove ks_cf_func and related code
2025-04-22 15:40:06 +03:00
Nadav Har'El
8d1a413357 test/scylla_gdb: better error message when running on dev build mode
The test/scylla_gdb suite needs Scylla to have been built with debug
symbols - which is NOT the case for the dev build. So the script
test/scylla_gdb/run attempts to recognize when a developer runs it
on an executable with the debug symbols missing - and prints a clear error.

Unfortunately, as we noticed in #10863, and again in #23832, because
wasmtime is compiled with debug symbols and linked with Scylla,
build/dev/scylla "pretends" to have debug symbols, foiling the check
in test/scylla_gdb/run. Reviewers rejected two solutions to this problem
(pull requests #10865 and #10923), so in pull request #10937 I added
a cosmetic solution just for test/scylla_gdb: in test/scylla_gdb/conftest.py
we check that there are **really** debug symbols that interest us,
and if not, exit immediately instead of failing each test separately.

For some reason, the sys.exit() we used is no longer effective - it
no longer exits pytest, so in this patch we use pytest.exit() instead.

Fixes #23832 (sort of, we leave build/dev/scylla with the fake claim
that it has debug symbols, but test/scylla_gdb will handle this
situation more gracefully).

Closes scylladb/scylladb#23834
2025-04-22 15:02:06 +03:00
Michael Litvak
5c1d24f983 test: test_mv_topology_change: increase timeout for remove_node
The test `test_mv_write_to_dead_node` currently uses a timeout of 60
seconds for remove_node, after it was increased from 30 seconds to fix
scylladb/scylladb#22953. Apparently it is still too low, and it was
observed to fail in debug mode.

Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000
seconds, but the test requires a timeout which is shorter than 5
minutes, because it is a regression test for an issue where MV updates
hold topology changes for more than 5 minutes, and we want to verify in
the test that the topology change completes in less than 5 minutes.

To resolve the issue, we set the test to skip in debug mode, because the
remove node operation is unpredictably slow, and we increase the timeout
to 180 seconds which is hopefully enough time for remove_node in
non-debug modes, and still sufficient to satisfy the test requirements.

Fixes scylladb/scylladb#22530

Closes scylladb/scylladb#23833
2025-04-22 10:51:19 +02:00
Kefu Chai
a2b46cbf45 sstables_loader: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-04-22 12:05:55 +08:00
Kefu Chai
6b3ecad467 sstables_loader: fix the racing between get_progress() and release_resources()
This change addresses a critical race condition in the sstables_loader
where `get_progress()` could access invalid `progress_holder` instances
after `release_resources()` destroyed them.

Problem:
- Progress tracking uses two components: `_progress_state` (tracks state)
  and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then
  accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted
  after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the
  `progress_holder` instances in `_progress_per_shard`, causing
  `get_progress()` to access invalid memory.

Solution:
- Implemented shared/exclusive locking to protect access to both state
  and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent
  progress reporting

Fixes #23801

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-04-22 12:05:54 +08:00
Ernest Zaslavsky
edaa3f4bdd s3_tests: Improve and extend copy object test coverage
Refactored the copy object test to enhance readability and maintainability.
The test was simplified and split into smaller, more focused parts.
Additionally, a "proxied" variant of the test was introduced to expand
coverage.
2025-04-21 20:54:14 +03:00
Ernest Zaslavsky
252a0a14af s3_tests: Implement post-test cleanup for uploaded objects
Ensure cleanup after tests by deleting objects uploaded to MinIO.
This improves resource management and maintains a clean test environment.
2025-04-21 20:54:14 +03:00
Avi Kivity
2dcd2b21ae Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

Backport to 2025.1 because load imbalance is a serious problem in production.

Closes scylladb/scylladb#23708

* github.com:scylladb/scylladb:
  tablets: Equalize per-table balance when allocating tablets for a new table
  load_sketch: Tolerate missing tablet_map when selecting for a given table
  tests: tablets: Simplify tests by moving common code to topology_builder
2025-04-21 17:06:30 +03:00
Pavel Emelyanov
eb5b52f598 Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
Fixes: scylladb/scylladb#22869

Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga

Closes scylladb/scylladb#23800

* github.com:scylladb/scylladb:
  doc: changing topology when changing snitches is no longer supported
  test: cluster: introduce test_no_dc_rack_change
  storage_service: don't update DC/rack in update_topology_with_local_metadata
  main: make dc and rack immutable after bootstrap
  test: cluster: remove test_snitch_change
2025-04-21 15:52:55 +03:00
Yaniv Michael Kaul
b374f94b15 pip installation: use --no-cache-dir
There are two reasons we may want NOT to use caching of pip deps:
1. When building a container, unless we specifically clean it up, it'll remain, even when we squash the image layers later.
2. When building a container, that cache is not useful, as we squash our containers later (so that layer is not cached really). And our CI cleans up the layers repo anyway.
3. Caching sometimes isn't great, and doesn't ensure we pick up the exact version (or latest) that we wish to...

This PR changes two locations in Scylla, both of which (also) build containers, so certainly relevant for 1, 2 above and possibly 3.
No real need to backport.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#23822
2025-04-21 13:46:57 +03:00
Avi Kivity
0ba3ce1741 test: gdb: avoid using file(1) to determine if debug information is present
The scylla_gdb tests verify, as a sanity check, that the executable
was built with debug information. They do so via file(1).

In Fedora 42, file(1) crashes on ELF files that have interpreter pathnames
larger than 128 characters[1]. This was later fixed[2], but the fix is not
in any release.

Work around the problem by using objdump instead of file.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2354970
[2] b3384a1fbf

Closes scylladb/scylladb#23823
2025-04-21 13:29:27 +03:00
Andrei Chekun
441cee8d9c test.py: fix gathering logs in case of fail
Currently log files have information about run_id twice:
cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log
However, sometimes the first run_id can be incorrect:
cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log
Removing first run_id in the name to not face this issue and because
it's actually redundant.
Removing creation empty file for scylla manager log, since it redundant
and was done as incorrect assumption on the root cause of the fail.
Add extension to the stacktrace file, so it will be opened in the
browser in Jenkins in the new tab instead of downloading it.

Fixes: https://github.com/scylladb/scylladb/issues/23731

Closes scylladb/scylladb#23797
2025-04-21 13:12:35 +03:00
Pavel Emelyanov
09caad6147 test: Remove sstable_assertions::get_stats_metadata()
It mirrors the sstable method of the same name, which is public. With ->
operator, it's just as convenient to call it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-04-18 18:53:41 +03:00
Pavel Emelyanov
294e56207d test: Add sstable_assertions::operator->()
... and replace get_sstable() with it. It's more natural (despite having
the only user) to consider the class to be yet another "pointer" to an
sstable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-04-18 18:52:39 +03:00
Sergey Zolotukhin
2314feeae2 test: Ignore DEBUG,TRACE,INFO level messages when checking for failed mutations.
Update the regular expression in `check_node_log_for_failed_mutations` to avoid
false test failures when DEBUG-level logging is enabled.

Fixes scylladb/scylladb#23688

Closes scylladb/scylladb#23658
2025-04-18 16:17:41 +03:00
Calle Wilund
4a44651fce encryption_at_rest_test: Make fake_proxy read/write loop noexcept
Fixes #23774

Test code falls into same when_all issue as http client did.
Avoid passing exceptions through this, and instead catch and
report in worker lambda.

Closes scylladb/scylladb#23778
2025-04-18 16:17:41 +03:00
Pavel Emelyanov
324daac156 Merge 'Add CopyObject API implementation to S3 client' from Ernest Zaslavsky
Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery

Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go

No need to backport since we are adding new functionality for a future use

Closes scylladb/scylladb#23779

* github.com:scylladb/scylladb:
  s3_client: implement S3 copy object
  s3_client: improve exception message
  s3_client: reposition local function for future use
2025-04-18 16:17:41 +03:00
Pavel Emelyanov
cc919b08c2 Merge 'backup: Optimize S3 throughput with shard-based upload' from Ernest Zaslavsky
This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads.

To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations.

Refs #22460
fixes: #22520

```
===========================================
 Release build, master, smp-16, mem-32GiB
 Bytes: 2342880184, backup time: 9.51 s
===========================================
 Release build, this PR, smp-16, mem-32GiB
 Bytes: 2342891015, backup time: 1.23 s
===========================================
```
Looks like it is faster at least x7.7

No backport needed since it (native backup) is still unused functionality

Closes scylladb/scylladb#23727

* github.com:scylladb/scylladb:
  backup: Add test for invalid endpoint
  backup_task: upload on all shards
  backup_task: integrate sharded storage manager for upload
2025-04-18 16:17:41 +03:00
Avi Kivity
6b415cfd4b Merge 'managed_bytes: in the copy constructor, respect the target preferred allocation size' from Michał Chojnowski
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

Closes scylladb/scylladb#23782

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-17 21:14:10 +03:00
Pavel Emelyanov
ca2cc5e826 Merge 'test/cluster/test_read_repair: make incremental test work with tablets' from Botond Dénes
There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.

Fixes: #21179

Test improvement, no backport required.

Closes scylladb/scylladb#23167

* github.com:scylladb/scylladb:
  wip
  test/cluster/test_read_repair: make incremental test work with tablets
2025-04-17 18:54:00 +03:00
Piotr Dulikowski
325a89638c doc: changing topology when changing snitches is no longer supported
Update the "How to Switch Snitches" document to indicate that changing
topology (i.e. changing node's DC or rack) while changing the snitch is
no longer supported.

Remove a note which said that switching snitches is not supported with
tablets. It was introduced because of the concern that switching a
snitch might change DC or rack of the node, for which our current tablet
load balancer is completely unprepated. Now that changing DC/rack is
forbidden, there doesn't seem to be anything related to snitches which
could cause trouble for tablets.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
796c8d1601 test: cluster: introduce test_no_dc_rack_change
The test makes sure that changing the DC or rack in the snitch's
configuration fails with an expected error.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
1791ae3581 storage_service: don't update DC/rack in update_topology_with_local_metadata
The DC/rack are now immutable and cannot be changed after restart, so
there is no need to update the node's system.topology entry with this
information on restart.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
ce2fab7cce main: make dc and rack immutable after bootstrap
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
2025-04-17 16:22:26 +02:00
Tomasz Grabiec
1e407ab4d2 tablets: Equalize per-table balance when allocating tablets for a new table
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631
2025-04-17 16:01:23 +02:00
Tomasz Grabiec
2597a7e980 load_sketch: Tolerate missing tablet_map when selecting for a given table
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
2025-04-17 16:01:16 +02:00
Ernest Zaslavsky
b79ca5a1aa backup: Add test for invalid endpoint
* During the development phase, the backup functionality broke because we lacked a test that runs backup with an invalid endpoint. This commit adds a test to cover that scenario.
* Add checking for the expected error to be propagated from failing/aborted backup
2025-04-17 16:31:43 +03:00
Benny Halevy
b7212620f9 backup_task: upload on all shards
Use all shards to upload snapshot files to S3.
By using the sharded sstables_manager_for_table
infrastructure.

Refs #22460

Quick perf comparison
===========================================
 Release build, master, smp-16, mem-32GiB
 Bytes: 2342880184, backup time: 9.51 s
===========================================
 Release build, this PR, smp-16, mem-32GiB
 Bytes: 2342891015, backup time: 1.23 s
===========================================

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
2025-04-17 16:31:42 +03:00
Piotr Dulikowski
dd2e507ece test: cluster: remove test_snitch_change
This test checked that it is possible to change DC/rack of a node during
restart. This will become explicitly forbidden, so remove the test.
2025-04-17 13:51:22 +02:00
Aleksandra Martyniuk
e178bd7847 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.
2025-04-17 13:48:44 +02:00
Aleksandra Martyniuk
53e0f79947 tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
2025-04-17 12:51:22 +02:00
Michał Chojnowski
6c1889f65c managed_bytes_test: add a reproducer for #23781 2025-04-17 12:51:01 +02:00
Botond Dénes
8ac7c54d8b Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

* The issue exists since 6.2

Closes scylladb/scylladb#17712

* github.com:scylladb/scylladb:
  topology_coordinator: stop: await all background_action_holder:s
  topology_coordinator: stop: improve error messages
  topology_coordinator: stop: define stop_background_action helper
2025-04-17 12:10:29 +03:00
Kefu Chai
b0cbe86780 s3/client: define a constant for security credential resource
instead of repeating it, let's define a consstant and reuse it.
less repeatings this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23713
2025-04-17 11:51:15 +03:00
Kefu Chai
a33651b03e db, service: do not include unused header
these unused headers were flagged by clang-include-cleaner.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23735
2025-04-17 11:49:59 +03:00
Botond Dénes
33e383c557 scripts/pull_github_pr.sh: add argument parsing
Instead of hardcoding PR_NUM=$1 and FORCE=$2. This current setup is
not very flexible and one gets no feedback if the arguments are
incorrect or not recognized.

Add proper position-independent argument parsing using a classic while
case loop.

Closes scylladb/scylladb#23623
2025-04-17 11:49:15 +03:00
Nadav Har'El
84d4af1f0e Merge 'Alternator batch rcu' from Amnon Heiman
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.

Need backporting to 2025.1, as RCU and WCU are not fully supported

Fixes #23690

Closes scylladb/scylladb#23691

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: test RCU for batch get item
  alternator/executor: Add RCU support for batch get items
  alternator/consumed_capacity: make functionality public
2025-04-17 10:08:16 +03:00
Botond Dénes
22a28ca1db wip 2025-04-17 03:01:17 -04:00
Ernest Zaslavsky
a369dda049 s3_client: implement S3 copy object
Add support for the CopyObject API to enable direct copying of S3
objects between locations. This approach eliminates networking
overhead on the client side, as the operation is handled internally
by S3.
2025-04-17 09:47:47 +03:00
Botond Dénes
19b4f10598 test/cluster/test_read_repair: make incremental test work with tablets
There are two tests which test incremental read repair: one with row the
other with partition tombstones. The tests currently force vnodes, by
creating the test keyspace with {'enabled': false}. Even so, the tests
were found to be flaky so one of them are marked for skip.
This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets:
  currently the tests use scylla-sstable write + upload but this won't
  work with tablets since upload with tablets implies --load-and-stream
  which means data is streamed to all replicas (no difference created
  between nodes). Switch to the classic stop-node + write to other
  replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.

Also add tracing to the read-repair query, to make debugging the test
easier if it fails.

Fixes: #21179
2025-04-17 02:01:17 -04:00
Michał Chojnowski
4e2f62143b managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
2025-04-16 22:06:06 +02:00
Nadav Har'El
6db666a1c1 replica: fix 10-second pause during shutdown
As noticed in issue #23687, if we shut down Scylla while a paged read is
in progress - or even a paged read that the client had no intention of
ever resume it - the shutdown pauses for 10 seconds.

The problem was the stop() order - we must stop the "querier cache"
before we can close sstables - the "querier cache" is what holds paged
readers alive waiting for clients to resume those reads, and while a
reader is alive it holds on to sstables so they can't be closed. The
querier cache's querier_cache::default_entry_ttl is set to 10 seconds,
which is why the shutdown was un-paused after 10 seconds.

This fix in this patch is obvious: We need to stop the querier cache
(and have it release all the readers it was holding) before we close
the sstables.

Fixes #23687

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23770
2025-04-16 20:35:44 +03:00
Avi Kivity
0206da5232 Merge 'readers: strip "flat" and "v2" from names' from Botond Dénes
Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2".
Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal.
One such reader remains: `make_generating_reader_v1()`.

This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog.

Code cleanup, no backport needed.

Closes scylladb/scylladb#23767

* github.com:scylladb/scylladb:
  readers: mv reversing_v2.hh reversing.hh
  readers: mv generating_v2.hh generating.hh
  tree: s/make_generating_reader_v2/make_generating_reader/
  readers: mv from_mutations_v2.hh from_mutations.hh
  tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s
  readers: mv from_fragments_v2.hh from_fragments.hh
  readers: mv forwardable_v2.hh forwardable.hh
  readers: mv empty_v2.hh empty.hh
  tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/
  readers/empty_v2.hh: replace forward declarations with include of fwd header
  readers/mutation_reader_fwd.hh: forward declare reader_permit
  readers: mv delegating_v2.hh delegating.hh
  readers/delegating_v2.hh: move reader definition to _impl.hh file
2025-04-16 20:21:51 +03:00
Ernest Zaslavsky
8929cb324e s3_client: improve exception message
Clarify that the multipart upload was aborted due to a failure in
parsing ETags.
2025-04-16 18:58:22 +03:00
Ernest Zaslavsky
993953016f s3_client: reposition local function for future use
The local function has been relocated higher in the code
to prepare for its usage in upcoming implementations.
2025-04-16 18:46:31 +03:00
Ernest Zaslavsky
428f673ca2 backup_task: integrate sharded storage manager for upload
Introduce the sharded storage manager and use it to instantiate upload
clients. Full functionality will be implemented in subsequent changes.
2025-04-16 18:18:58 +03:00
Amnon Heiman
3acde5f904 test_returnconsumedcapacity.py: test RCU for batch get item
This patch adds tests for consumed capacity in batch get item.  It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.
2025-04-16 17:05:32 +03:00
Pavel Emelyanov
8b2cababb6 generic_server: Don't mess with db::config
The db::config is top-level configuration of scylla, we generally try to
avoid using it even in scylla components: each uses its own config
initialized by the service creator out of the db::config itself. The
generic_server is not an exception, all the more so, it already has its
own config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23705
2025-04-16 17:02:30 +03:00
Amnon Heiman
88095919d0 alternator/executor: Add RCU support for batch get items
This patch adds RCU support for batch get items.  With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.
2025-04-16 16:53:22 +03:00
Amnon Heiman
0eabf8b388 alternator/consumed_capacity: make functionality public
The consumed_capacity_counter is not completely applicable for batch
operations.  This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.
2025-04-16 16:49:40 +03:00
Benny Halevy
7a0f5e0a54 topology_coordinator: stop: await all background_action_holder:s
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:23:02 +03:00
Benny Halevy
6de79d0dd3 topology_coordinator: stop: improve error messages
"when cleanup" is ill-formed. Use "when XYZ"
to "during XYZ" instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:20:58 +03:00
Benny Halevy
d624795fda topology_coordinator: stop: define stop_background_action helper
Refactor the code to use a helper to await background_action_holder
and handle any errors by printing a warning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:20:39 +03:00
Botond Dénes
6172ff501f readers: mv reversing_v2.hh reversing.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
c8563b9604 readers: mv generating_v2.hh generating.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
dfd7f03463 tree: s/make_generating_reader_v2/make_generating_reader/
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
c29c696780 readers: mv from_mutations_v2.hh from_mutations.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
b104862702 tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s
Completely mechanical change.
2025-04-16 04:46:07 -04:00
Anna Stuchlik
0b4740f3d7 doc: add info about Scylla Doctor Automation to the docs
Fixes https://github.com/scylladb/scylladb/issues/23642

Closes scylladb/scylladb#23745
2025-04-16 11:44:35 +03:00
Botond Dénes
7547d0c6a9 readers: mv from_fragments_v2.hh from_fragments.hh
Completely mechanical change.
2025-04-16 04:35:00 -04:00
Botond Dénes
f1bd2553ed readers: mv forwardable_v2.hh forwardable.hh
Completely mechanical change.
2025-04-16 04:33:50 -04:00
Botond Dénes
a9d75c4f9d readers: mv empty_v2.hh empty.hh
Completely mechanical change.
2025-04-16 04:32:56 -04:00
Botond Dénes
05829f98f3 tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/
Completely mechanical change.
2025-04-16 04:32:56 -04:00
Botond Dénes
0e33f0d09e readers/empty_v2.hh: replace forward declarations with include of fwd header 2025-04-16 04:12:08 -04:00
Botond Dénes
d75936d989 readers/mutation_reader_fwd.hh: forward declare reader_permit
It is commonly used as parameter to reader factory methods.
2025-04-16 04:12:08 -04:00
Botond Dénes
7d9b91a00e readers: mv delegating_v2.hh delegating.hh
Completely mechanical change.
2025-04-16 04:11:55 -04:00
Botond Dénes
c7f68a2649 readers/delegating_v2.hh: move reader definition to _impl.hh file
The idea behind readers/ is that each reader has its minimal header with
just a factory method declaration. The delegating reader is defined in
the factory header because it has a derived class in row_cache_test.cc.
Move the definition to delegating_impl.hh so users not interested in
deriving from it don't pay the price in header include cost.
2025-04-16 03:47:57 -04:00
Pavel Emelyanov
70ac5828a8 Update seastar submodule
* seastar 099cf616...e44af9b0 (19):
  > Add assertion to `get_local_service`
  > http_client: Improve handling of server response parsing errors
  > util: include used header
  > core: Fix module linkage by using `inline constexpr` for shared constants
  > build: fix P2582R1 detection for GCC compiler compatibility
  > app-template: remove production warning
  > ioinfo: Extend printed data a bit more
  > reactor: Fix indentation after previous patch
  > reactor: Configure multiple mountpoints per disk
  > io_queue, resource, reactor: Rename dev_t -> unsigned
  > resource: Rename mountpoint to disk in resources
  > reactor: Keep queues as shared_ptr-s
  > io_queue: Drop device ID
  > io_intent: Use unsigned queue id as a key
  > io_queue: Keep unsigned queue id on an io_queue
  > file: Keep device_id on posix file impl
  > io_queue: Print mountpoint in latency goal bump message
  > io_intent: Rename qid to cid
  > reactor: Move engine()._num_io_groups assignment and check

Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.

Closes scylladb/scylladb#23733
2025-04-16 09:44:37 +03:00
Botond Dénes
f5125ffa18 Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

Closes scylladb/scylladb#22779

* github.com:scylladb/scylladb:
  Ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-16 09:11:29 +03:00
Lakshmipathi
42ed6a87bf test: Test truncate during topology change
Add a new node, during topology change issue truncate call and
verify all nodes empty data after tablet migration.

Fixes: https://github.com/scylladb/scylla-dtest/issues/5317

Signed-off-by: Lakshmipathi Ganapathi <lakshmipathi.ganapathi@scylladb.com>

Closes scylladb/scylladb#22595
2025-04-16 09:10:22 +03:00
Tomasz Grabiec
001d3b2415 Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

Closes scylladb/scylladb#23751

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-16 00:19:24 +02:00
Pavel Emelyanov
b79137eaa4 storage_service: Use this->_features directly
This dependency is already there, storage service doesn't need to go
rounds via database reference to get to the features.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23739
2025-04-15 21:11:12 +03:00
Tomasz Grabiec
d493a8d736 tests: tablets: Simplify tests by moving common code to topology_builder
Reduces code duplication.
2025-04-15 16:05:41 +02:00
Laszlo Ersek
841ca652a0 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2025-04-15 15:15:25 +02:00
Michał Chojnowski
b3d951517d test/scylla_gdb: generate a coredump when coro_task fails
This test fails sometimes, but rarely and unreliably.
We want to get a coredump from it the next time it fails.
Sending a SIGSEGV should induce that.

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#23256
2025-04-15 15:16:38 +03:00
Calle Wilund
abd2d8a58b test_tools: Manual merge of local key gen tool test from enterprise
Fixes scylladb/scylla-enterprise#5358

Transposed tool test for local file generator, originally java test.
Then enterprise test. Now here.

Closes scylladb/scylladb#23726
2025-04-15 15:14:08 +03:00
Laszlo Ersek
e1186f0ae6 storage_service: preserve state of busy topology when transiting tablet
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2025-04-15 13:44:45 +02:00
Piotr Dulikowski
22e3b8eccd Merge 'test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek
In this PR, we adjust tests in the cqlpy test suite so they
only use RF-rack-valid keyspaces. After that, we enable
the configuration option `rf_rack_valid_keyspaces` in the
suite by default.

Refs scylladb/scylladb#23428

Backport: backporting to 2025.1 so we can test the option there too.

Closes scylladb/scylladb#23489

* github.com:scylladb/scylladb:
  test/cqlpy: Enable rf_rack_valid_keyspaces by default
  test: Move test_alter_tablet_keyspace_rf to cluster suite
  test/cqlpy: Adjust tests to RF-rack-valid keyspaces
  test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
2025-04-15 12:43:11 +02:00
Avi Kivity
b4d4e48381 scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604
2025-04-15 11:16:52 +03:00
Emil Maskovsky
3930ee8e3c raft: fix data center remaining nodes initialization
The `_remaining_nodes` attribute of the data center information was not
initialized correctly. The parameter was passed by value to the
initialization function instead of by reference or pointer.

As a result, `_remaining_nodes` was left initialized to zero, causing an
underflow when decrementing its value.

This bug did not significantly impact behavior because other safeguards,
such as capping the maximum voters per data center by the total number
of nodes, masked the issue. However, it could lead to inefficiencies, as
the remaining nodes check would not trigger correctly.

Fixes: scylladb/scylladb#23702

No backport: The bug is only present in the master branch, so no backport
is required.

Closes scylladb/scylladb#23704
2025-04-15 09:58:32 +02:00
Nadav Har'El
fbcf77d134 raft: make group0 Raft operation timeout configurable
A recent commit 370707b111 (re)introduced
a timeout for every group0 Raft operation. This timeout was set to 60
seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody".

However, one of the things we do as a group0 operation is schema
changes, and we already noticed a few years ago, see commit
0b2cf21932, that in some extremely
overloaded test machines where tests run hundreds of times (!) slower
than usual, a single big schema operation - such as Alternator's
DeleteTable deleting a table and multiple of its CDC or view tables -
sometimes takes more than 60 seconds. The above fix changed the
client's timeout to wait for 300 seconds instead of 60 seconds,
but now we also need to increase our Raft timeout, or the server can
time out. We've seen this happening recently making some tests flaky
in CI (issue #23543).

So let's make this timeout configurable, as a new configuration option
group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e,
60 seconds), the same as the existing default. The test framework
overrides this default with a a higher 300 second timeout, matching
the client-side timeout.

Before this patch, this timeout was already configurable in a strange
way, using injections. But this was a misstep: We already have more
than a dozen timeouts configurable through the normal configration,
and this one should have been configured in the same way. There is
nothing "holy" about the default of 60 seconds we chose, and who
knows maybe in the future we might need to tweek it in the field,
just like we made the other timeouts tweakable. Injections cannot
be used in release mode, but configuration options can.

Fixes #23543

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23717
2025-04-15 10:57:39 +03:00
Kefu Chai
3e3f583b84 docs/dev/tombstone.md: fix a typo
s/alwas/always/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23734
2025-04-15 10:54:42 +03:00
Avi Kivity
5e1cf90a51 build: replace tools/java submodule with packaged cassandra-stress
We no longer use tools/java (scylladb/scylla-tools-java.git) for
nodetool or cqlsh; only cassandra-stress. Since that is available
in package form install that and excise the tools/java submodule
from the source tree.

pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh
submodule).

A few jmx references are dropped as well.

Frozen toolchain regenerated.

Optimized clang from

  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#23698
2025-04-15 10:11:28 +03:00
Jenkins Promoter
9699c3ded4 Update pgo profiles - aarch64 2025-04-15 04:45:34 +03:00
Jenkins Promoter
8472aa9e53 Update pgo profiles - x86_64 2025-04-15 04:29:24 +03:00
Pavel Emelyanov
b25cb5af0c Merge 'Use named gates' from Benny Halevy
Name the gates and phased barriers we use
to make it easy to debug gate_closed_exception

Refs https://github.com/scylladb/seastar/pull/2688

* Enhancement only, no backport needed

Closes scylladb/scylladb#23329

* github.com:scylladb/scylladb:
  utils: loading_cache: use named_gate
  utils: flush_queue: use named_gate
  sstables_manager: use named gate
  sstables_loader: use named gate
  utils: phased_barrier, pluggable: use named gate
  utils: s3::client::multipart_upload: use named gate
  utils: s3::client: use named_gate
  transport: controller: use named gate
  tracing: trace_keyspace_helper: use named gate
  task_manager: module: use named gate
  topology_coordinator: use named gate
  storage_service: use named gate
  storage_proxy: wait_for_hint_sync_point: use named gate
  storage_proxy: remote: use named gate
  service: session: use named gate
  service: raft: raft_rpc: use named gate
  service: raft: raft_group0: use named gate
  service: raft: persistent_discovery: use named gate
  service: raft: group0_state_machine: use named gate
  service: migration_manager: use named gate
  replica: table: use named gate
  replica: compaction_group, storage_group: use named gate
  redis: query_processor: use named gate
  repair: repair_meta: use named gate
  reader_concurrency_semaphore: use named gate
  raft: server_impl: use named gate
  querier_cache: use named gate
  gms: gossiper: use named gate
  generic_server: use named gate
  db: sstables_format_listener: use named gate
  db: snapshot: backup_task: use named gate
  db: snapshot_ctl: use named gate
  hints: hints_sender: use named gate
  hints: manager: use named gate
  hints: hint_endpoint_manager: use named gate
  commitlog: segment_manager: use named gate
  db: batchlog_manager: use named gate
  query_processor: remote: use named gate
  compaction: compaction_state: use named gate
  alternator/server: use named_gate
2025-04-14 20:56:32 +03:00
Sergey Zolotukhin
e05c082002 Ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.
2025-04-14 17:10:46 +02:00
Sergey Zolotukhin
60f1053087 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637
2025-04-14 17:09:49 +02:00
Pavel Emelyanov
1bd991a111 test: Inherit sstable_assertions from sstables::test
The latter class is invented to let tests access private fields of an
sstable (mostly methods). The former is in fact an extended version of
that also does some checks. Howerver, they don't inherit from each
other, and the sstable_assertions partially duplicates some funtionality
of the test one.

Add the inheritance, remove the duplicated methods from the child class,
update the callers (the test class returns future<>s, the assertions one
"knows" it runs in seastar thread) and marm sstable::read_toc() private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23697
2025-04-14 13:45:14 +03:00
Kefu Chai
b3f709bed7 s3: remove an extraneous space
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23714
2025-04-14 13:02:58 +03:00
Michał Chojnowski
6e2795a843 Update seastar submodule
* seastar ed8952fb...099cf616 (10):
  > reactor: Disable hot polling if wakeup granularity is too high
  > smp: add shard_to_numa_node_mapping()
  > tests/unit/httpd_test: fix the handling of NUL bytes in the parser
  > fstream: skip allocation in no write_behinds case
  > `http`: add `xml` support to `http::mime_types::mappings`
  > Print incrementally in sigsegv handler
  > reactor: use 0x for hex addresses
  > tls: Make session resume key shared across credentials builders creds
  > build: fix CMAKE_REQUIRED_FLAGS format for sanitizer detection
  > reactor: Remove sched_debug() related code

Closes scylladb/scylladb#23703
2025-04-14 12:54:19 +03:00
Andrei Chekun
8e33d7ab81 test.py: Make the testpy log files in pytest follow the same format
Fix the incorrect log file names between conftest and scylla_manager.
This regression issue, was introduced in #22960.

Currently, scylla manager will output it's logs to the file with the
next pattern:
suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log
On the same time pytest will try to find this log with next name:
suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log

This inconsistency leads to the situation when the test failed, scylla
manager log file will not be copied to the failed_test directory and
test will have exception on teardown.

Closes scylladb/scylladb#23596
2025-04-14 12:52:48 +03:00
Evgeniy Naydanov
d6b64642c5 test.py: print out path to Scylla log for Python test suites
Test suites with `type: Python` are using single Scylla node
created by test.py, but it's handy to print a path to a log
file in pytest log too to make it easier to find the file
on failures.

Closes scylladb/scylladb#23683
2025-04-14 11:15:37 +03:00
Kefu Chai
69de816b1b scylla-gdb.py: fix a typo in gdb command description
replace "runnign" with "running".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23716
2025-04-14 10:59:21 +03:00
Benny Halevy
8d7e4d6c36 utils: loading_cache: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:09 +03:00
Benny Halevy
46f2a24772 utils: flush_queue: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:02 +03:00
Benny Halevy
d665bb4f8b sstables_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
7969293dcf sstables_loader: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
e1fe82ed33 utils: phased_barrier, pluggable: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
d3f498ae59 utils: s3::client::multipart_upload: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
eea83464c7 utils: s3::client: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:46:51 +03:00
Benny Halevy
79e967e2f5 transport: controller: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
3d87b67d0e tracing: trace_keyspace_helper: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
bfdd8a98ca task_manager: module: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
5e864b6277 topology_coordinator: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:46 +03:00
Benny Halevy
a67ed59399 storage_service: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
39f1175451 storage_proxy: wait_for_hint_sync_point: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
e228a112fe storage_proxy: remote: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
0a1e7de6ea service: session: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
747446cb25 service: raft: raft_rpc: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
01bb3980fc service: raft: raft_group0: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
6118150d44 service: raft: persistent_discovery: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
e430df6332 service: raft: group0_state_machine: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
5f8b5724e6 service: migration_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
7342a57cbb replica: table: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
52e1ce7f0d replica: compaction_group, storage_group: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
aff6017e83 redis: query_processor: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
80b5089d0c repair: repair_meta: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
679e73053f reader_concurrency_semaphore: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
9724d87e86 raft: server_impl: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
5780599eec querier_cache: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
cecfb6dfd7 gms: gossiper: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
bc69bc3de7 generic_server: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
5a71763d75 db: sstables_format_listener: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
da492231df db: snapshot: backup_task: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
edf497c170 db: snapshot_ctl: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
c5d7272393 hints: hints_sender: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
1c1adb3d60 hints: manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
4c475a1905 hints: hint_endpoint_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
bdd5a61139 commitlog: segment_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
0672c9da5c db: batchlog_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
f8d5835cab query_processor: remote: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
747ae5e1c4 compaction: compaction_state: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
879811e0d2 alternator/server: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Dawid Mędrek
be0877ce69 test/cqlpy: Enable rf_rack_valid_keyspaces by default
All of the tests in the suite have been adjusted so they only
use RF-rack-valid keyspaces, so let's start enabling the option
by default.
2025-04-11 14:55:13 +02:00
Dawid Mędrek
a59842257a test: Move test_alter_tablet_keyspace_rf to cluster suite
We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the
cluster test suite. The reason behind the change is that the test cannot
be run with `rf_rack_valid_keyspaces` turned on in the configuration.
During the test, we make the keyspace RF-rack-invalid multiple times.
Since RF-rack-validity is a very strong constraint, adjust the test
otherwise is impossible.

By moving it to the cluster test suite, we're able to change the
configuration of the node used in the test, and so the test can work
again.
2025-04-11 14:55:11 +02:00
Dawid Mędrek
958eaec056 test/cqlpy: Adjust tests to RF-rack-valid keyspaces 2025-04-11 14:55:04 +02:00
Dawid Mędrek
6bde01bb59 test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
We adjust three existing Cassandra tests so that they don't create
RF-rack-invalid keyspaces. We modify the replication factor used
in the problematic tests. The changes don't affect the tests as
the value of the RF is unrelated to what they verify. Thanks to
that, we can run them now even with enforced RF-rack-valid keyspaces.

The drawback is that the modified ALTER statements do not modify
the RF at all. However, since the tests seem to verify that the code
responsible for VALIDATING a request works as intended, that should
have little to no impact on them.
2025-04-11 14:20:14 +02:00
Dawid Mędrek
10589e966f test/cluster/mv: Adjust test to RF-rack-valid keyspaces
We adjust the test in the directory so that all of the used
keyspaces are RF-rack-valid throughout the their execution.

Refs scylladb/scylladb#23428

Closes scylladb/scylladb#23490
2025-04-11 14:03:21 +02:00
Karol Baryła
df64985a4e Docs: Describe driver issue with tablet RF increase
Current protocol extension that sends tablet info to drivers only does
that if the driver selects a non-replica coordinator for a routable
request. It works well if some node on the replica list is replaced by
other node, or if some replicas are removed from the list. Driver will
at some point send a request to stale replica, and receive new list in
response.

The issue is with extending the list with new replicas. In that case old
replicas are all still correct, so driver will not select any wrong
replica, and will not receive the new list. As far as I know that only
scenario where this could happen is RF increase.

It could be to some degree worked around in the drivers, but it would
add significant complexity (definitely more than any other invalidations
we introduced) while still not being ideal solution. This scenario
should be rare enough, and the consequences of not handling it minor
enough (new replicas not being used as coordinators) that it does not
warrant driver-side solution. Instead this commit adds info about this
to documentation, advising users to restart applications after replica
lists are extended.

It is worth noting that if new tablet feedback protocol extension is
implemented then this problem goes away. See issue #21664.

Closes scylladb/scylladb#23447
2025-04-11 13:48:40 +02:00
David Garcia
cf11d5eb69 fix: openapi not rendering in docs.scylladb.com/manual
Closes scylladb/scylladb#23686
2025-04-10 17:47:58 +03:00
Patryk Jędrzejczak
07a7a75b98 Merge 'raft: implement the limited voters feature' from Emil Maskovsky
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.

The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.

Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).

Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total

Fixes: scylladb/scylladb#18793

No backport: This is a new feature that will not be backported.

Closes scylladb/scylladb#21969

* https://github.com/scylladb/scylladb:
  raft: distribute voters by rack inside DC
  raft/test: fix lint warnings in `test_raft_no_quorum`
  raft/test: add the upgrade test for limited voters feature
  raft topology: handle on_up/on_down to add/remove node from voters
  raft: fix the indentation after the limited voters changes
  raft: implement the limited voters feature
  raft: drop the voter removal from the decommission
  raft/test: disable the `stop_before_becoming_raft_voter` test
  raft/test: stop the server less gracefully in the voters test
2025-04-10 15:29:15 +02:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Avi Kivity
885838fc46 Merge 'scylla-gdb.py: improve scylla repairs command' from Botond Dénes
Make output more readable by:
* group follower/master repair instances separately
* split repair details into one line for repair summary, then one line for each host info
* add indentation to make the output easier to follow

Also add `-m|--memory` option to calculate memory usage of repair buffers.

Example output:

    (gdb) scylla repairs -m
    Repairs for which this node is leader:
      (repair_meta*) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
      (repair_meta*) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished
      (repair_meta*) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished
      (repair_meta*) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
    Repairs for which this node is follower:

Closes scylladb/scylladb#23075

* github.com:scylladb/scylladb:
  scylla-gdb.py: improve scylla repairs commadn
  scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__
  scylla-gdb.py: introduce managed_bytes
2025-04-10 14:52:43 +03:00
Dani Tweig
e92740cc2b .github: update bug_report.yml
Perform a yaml "face lift" on the old bug report md template, making bug reporting more efficient.

- Add dedicated textarea fields for problem description and expected behavior
- Include pre-filled placeholders to guide issue reporting
- Add formatted log output section with shell syntax highlighting

Closes: #21532
2025-04-10 14:26:00 +03:00
Pavel Emelyanov
88318d3b50 topology_coordinator: Use shorter fault-injection overloads
There are few places that want to pause until a message is received from
the test. There's a convenience one-line suger to do it.

One test needs update its expectations about log message that appears
when scylle steps on it and actually starts waiting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23390
2025-04-10 14:05:46 +03:00
Botond Dénes
d67202972a mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657
2025-04-10 13:19:57 +03:00
Pavel Emelyanov
4de48a9d24 encryption: Mark parts of encrypted_data_sink private
Nowadays the whole class is public, but it's not in fact such.
Remove the SUDDENLY unused private _flush_pos member to please the
compiler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23677
2025-04-10 12:42:57 +03:00
Dawid Mędrek
0ed21d9cc1 test/cluster/test_tablets.py: Fix test errorneous indentation
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.

Closes scylladb/scylladb#23659
2025-04-10 11:06:01 +03:00
Nadav Har'El
258213f73b Merge 'Alternator batch count histograms' from Amnon Heiman
This series adds a histogram for get and write batch sizes.
It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works
extremely tight to 20 but still covers all the way to 100.

Histograms will be reported per node.

**Backport to 2025.1 so we'll have information about user batch size limitation**

Closes scylladb/scylladb#23379

* github.com:scylladb/scylladb:
  alternator: Add tests for the batch items histograms
  alternator: Add histogram for batch item count
2025-04-09 22:41:14 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Avi Kivity
ed3e4f33fd Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz
Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency.

It should help to reduce cpu usage by limiting cpu concurrency for new connections.  As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed.

New connections_shed and connections_blocked metrics are added for tracking.

Testing:
 - manually via simple program creating high number of connection and constantly re-connecting
 - added benchmark

Following are benchmark results:

Before:
```
> build/release/test/perf/perf_generic_server --smp=1
170101.41 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4695 insns/op,    3178 cycles/op,        0 errors)
[...]
throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54
instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96
  cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15
```

After:
```
> build/release/test/perf/perf_generic_server --smp=1
167373.19 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4821 insns/op,    3371 cycles/op,        0 errors)
[...]
throughput:
  mean=   171199.55 standard-deviation=2484.58
  median= 171667.06 median-absolute-deviation=2087.63
  maximum=173689.11 minimum=167904.76
instructions_per_op:
  mean=   4801.90 standard-deviation=16.54
  median= 4796.78 median-absolute-deviation=9.32
  maximum=4830.71 minimum=4789.81
cpu_cycles_per_op:
  mean=   3245.26 standard-deviation=32.28
  median= 3230.44 median-absolute-deviation=16.52
  maximum=3297.39 minimum=3215.62
```

The patch adds around 67 insns/op so it's effect on performance should be negligible.

Fixes: https://github.com/scylladb/scylladb/issues/22844

Closes scylladb/scylladb#22828

* github.com:scylladb/scylladb:
  transport: move on_connection_close into connection destructor
  test: perf: make aggregated_perf_results formatting more human readable
  transport: add blocked and shed connection metrics
  generic_server: throttle and shed incoming connections according to semaphore limit
  generic_server: add data source and sink wrappers bookkeeping network IO
  generic_server: coroutinize part of server::do_accepts
  test: add benchmark for generic_server
  test: perf: add option to count multiple ops per time_parallel iteration
  generic_server: add semaphore for limiting new connections concurrency
  generic_server: add config to the constructor
  generic_server: add on_connection_ready handler
2025-04-09 21:41:38 +03:00
Tomasz Grabiec
5b5ada1743 tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
Per-node capacity is queried from system.load_per_node

Tablet height in each node is scaled so that equal height = equal node
utilization.

The nominal height is assigned to the node which has the smallest
capacity, so nodes with higher capacity will have smaller tablets than
normal.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
217184f16b tablet-mon.py: Center tablet id text properly in the vertical axis
Was too low due to not subtracting frame size from height
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
20cac72056 tablet-mon.py: Show migration stage tag in table mode only when migrating
It's the gray bar at the top of the tablet. It's not showing useful
information when tablet is not migrating.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
668094dc58 virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
So that population can access read's timeout and mark the permit as awaiting.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
34beaa30b5 docs: virtual-tables: Fix instructions 2025-04-09 20:21:51 +02:00
Tomasz Grabiec
76bc11c78c service: tablets: Keep load_stats inside tablet_allocator
So that virtual tables can pick them up.

It's a better place to keep them than in topology_coordinator.
2025-04-09 20:21:51 +02:00
Pavel Emelyanov
d9853efa7c Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy
The motivation behind this change to free up disk space as early as possible.
The reason is that snapshot locks the space of all SSTables in the snapshot,
and deleting form the table, for example, by compaction, or tablet migration,
won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot.

This series adds prioritization of deleted sstables in two cases:
First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the
list of SSTables presently in the table and any generation that is not in the table is prioritized to
be uploaded earlier.
In addition, a subscription mechanism was added to sstables_manager
and it is used in backup to prioritize SSTables that get deleted from the table directory
during backup.

This is particularly important when backup happens during high disk utilization (e.g. 90%).
Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes
to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the
snapshot taken for backup.

* Enhancement, no backport needed

Closes scylladb/scylladb#23241

* github.com:scylladb/scylladb:
  db: snapshot: backup_task: prioritize sstables deleted during upload
  sstables_manager: add subscriptions
  db: snapshot: backup_task: limit concurrency
  sstables: directory_semaphore: expose get_units
  db: snapshot: backup_task: add sharded sstables_manager
  database: expose get_sstables_manager(schema)
  db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table
  db: snapshot-ctl: pass table_id to backup_task
  db: snapshot-ctl: expose sharded db() getter
  db: snapshot: backup_task: do_backup: organize components by sstable generation
  db: snapshot: coroutinize backup_task
  db: snapshot: backup_task: refactor backup_file out of uploads_worker
  db: snapshot: backup_task: refactor uploads_worker out of do_backup
  db: snapshot: backup_task: process_snapshot_dir: initialize total progress
  utils/s3: upload_progress: init members to 0
  db: snapshot: backup_task: do_backup: refactor process_snapshot_dir
  db: snapshot: backup_task: keep expection as member
2025-04-09 15:32:11 +03:00
Marcin Maliszkiewicz
ce18909688 transport: move on_connection_close into connection destructor
To make the code more robust by ensuring closing code is always executed.
2025-04-09 13:50:19 +02:00
Pavel Emelyanov
35dfc8c782 Merge 'audit: add semaphore to audit_syslog_storage_helper' from Andrzej Jackowski
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Moved syslog_send_helper to audit_syslog_storage_helper
 - Corutinize audit_syslog_storage_helper
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.

See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest
Fixes: scylladb#22973

Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1.

Closes scylladb/scylladb#23464

* github.com:scylladb/scylladb:
  audit: add semaphore to audit_syslog_storage_helper
  audit: corutinize audit_syslog_storage_helper
  audit: moved syslog_send_helper to audit_syslog_storage_helper
2025-04-09 12:39:06 +03:00
Marcin Maliszkiewicz
619944555f test: perf: make aggregated_perf_results formatting more human readable
Before:
throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97
instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70
cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00

After:
throughput:
	mean=   168224.81 standard-deviation=854.48
	median= 168829.02 median-absolute-deviation=604.21
	maximum=168829.02 minimum=167620.60
instructions_per_op:
	mean=   4837.02 standard-deviation=20.89
	median= 4851.79 median-absolute-deviation=14.77
	maximum=4851.79 minimum=4822.24
cpu_cycles_per_op:
	mean=   3271.42 standard-deviation=46.29
	median= 3304.16 median-absolute-deviation=32.73
	maximum=3304.16 minimum=3238.69
2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz
599f4d312b transport: add blocked and shed connection metrics
This adds some visibility into connection storm mitigations
added in following commits.
2025-04-09 10:49:18 +02:00
Marcin Maliszkiewicz
26518704ab generic_server: throttle and shed incoming connections according to semaphore limit
If we have uninitialized_connections_semaphore_cpu_concurrency (default
2) connections being processed we start delay accepting new connections.

Connections which are in network IO state are not counted towards this
limit and they can go to cpu phase without blocking. So it can happen
that we process more concurrent new connections but that's a necessary
tradeof to make progress during storm without implementing more advanced
machinery (i.e. priority queue).
2025-04-09 10:48:51 +02:00
Marcin Maliszkiewicz
9f5de2c256 generic_server: add data source and sink wrappers bookkeeping network IO
They release semaphore units when we start network IO and acquire it
when we enter cpu intensive phase. We use consume() so it doesn't block
because we don't want connections we started processing to compete with
new incomming connections. Otherwise during connection storm we wouldn't
make much progress.

There will be a simplification here as we'll treat disc IO (if there is any)
as cpu work.
2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
c56116372e generic_server: coroutinize part of server::do_accepts 2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
719d04d501 test: add benchmark for generic_server
Changes in configure.py are needed becuase we don't want to embed
this benchmark in scylla binary as perf_simple_query or perf_alternator,
it doesn't directly translate to Scylla performance but we want to use
aggregated_perf_results for precise cpu measurements so we need
different dependecies.
2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
b957cedace test: perf: add option to count multiple ops per time_parallel iteration 2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
ed82bede39 generic_server: add semaphore for limiting new connections concurrency
It will be used in following commits.
2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
33122d3f93 generic_server: add config to the constructor 2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
474e84199c generic_server: add on_connection_ready handler
This patch cleans the code a bit so that ready state is set in a single place.
And adds handler which will allow adding logic when connection is made
ready, this will be added in the following commits.
2025-04-09 10:30:58 +02:00
Benny Halevy
1ab3ec061b db: snapshot: backup_task: prioritize sstables deleted during upload
subscribe on each shard's sstables_manager to get
callback notifications and keep the generation numbers
of deleted sstables in a vector so they can be prioritized
first to free up their disk space as soon as possible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d8b0c661e4 sstables_manager: add subscriptions
Allow other submodules to subscribe for added/deleted
notifications.  This will be used in a later to
patch to prioritize unlinked sstables for backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d3b4874ec3 db: snapshot: backup_task: limit concurrency
Otherwise, once all the background tasks are created
we have no way to reorder the queue.

Fixes #23239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
e60fcc58b7 sstables: directory_semaphore: expose get_units
To be used by a following patch for
backup concurrency control.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
b7807ec165 db: snapshot: backup_task: add sharded sstables_manager
Get a reference to the table's sstables_manager
on each shard.  This will be used be later patches
to limit concurrency and to subscribe for notifications.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
b270d552fb database: expose get_sstables_manager(schema)
Return either the system or use sstables manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
9a4b4afade db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table
Detect SSTables that are already deleted from the table
in process_snapshot_dir when their number_of_links is equal to 1.

Note that the SSTable may be hard-linked by more than one snapshot,
so even after it is deleted from the table, its number of links
would be greater than one.  In that case, however, uploading it
earlier won't help to free-up its capacity since it is still held
by other snapshots.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
4b8699e278 db: snapshot-ctl: pass table_id to backup_task
To be used by the following patches to get
to the table's sstables_manager for concurrency
control and for notifications (TBD).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d646603bfd db: snapshot-ctl: expose sharded db() getter
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
63bc1d4626 db: snapshot: backup_task: do_backup: organize components by sstable generation
Do not rely on the snapshot directory listing order.
This will become useful for prioritizing unlinked
sstables in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:06 +03:00
Benny Halevy
a731c1b33d db: snapshot: coroutinize backup_task
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
189075b885 db: snapshot: backup_task: refactor backup_file out of uploads_worker
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
e3ba425c2b db: snapshot: backup_task: refactor uploads_worker out of do_backup
Let do_backup deal only with the high level coordination.
A future patch will follow this structure to run
uploads_worker on each shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
ff25b4c97f db: snapshot: backup_task: process_snapshot_dir: initialize total progress
Now we can calculate advance how much data we intend to upload
before we start uploading it.

This will be used also later when uploading in parallel
on all shards, so we can collect the progress from all
shards in get_progress().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:51 +03:00
Benny Halevy
6da215e8af utils/s3: upload_progress: init members to 0
For default construction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Benny Halevy
70307e8120 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir
Do preliminary listing of the snapshot dir.

While at it, simplify the loop as follows:
The optional directory_entry returned by snapshot_dir_lister.get()
can be checked as part of the loop condition expression,
and with that, error handling can be simplified and moved
out of the loop body.

A followup patch will organize the component files
by their sstable generation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

db: snapshot: backup_task: process_snapshot_dir: simplify loop

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Benny Halevy
8a4b6b9614 db: snapshot: backup_task: keep expection as member
As part of refactoring do_backup().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Botond Dénes
b65a76ab6f Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.

The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.

The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/22409.

Needs backport to 2025.1 that introduces the new tablet repair API

Closes scylladb/scylladb#22905

* github.com:scylladb/scylladb:
  docs: nodetool: update repair and add tablet-repair docs
  test: nodetool: add tests for cluster repair command
  nodetool: add cluster repair command
  nodetool: repair: extract getting hosts and dcs to functions
  nodetool: repair: warn about repairing tablet keyspaces
  nodetool: repair: move keyspace_uses_tablets function
2025-04-09 08:20:34 +03:00
Botond Dénes
5f697d373f test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests
These tests seem to be hitting the io-uring bug in the kernel from
time-to-time, making CI flaky. Force the use of the AIO backend in these
tests, as a workaround until fixed kernels (>=6.8.13) are available.

Fixes: #23517
Fixes: #23546

Closes scylladb/scylladb#23648
2025-04-08 20:29:58 +03:00
Benny Halevy
dfdca2d84e locator: topology: drop unused calculate_datacenters
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23647
2025-04-08 19:04:56 +03:00
Tomasz Grabiec
06b49bdf69 Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

Closes scylladb/scylladb#23255

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-08 17:26:58 +02:00
Andrzej Jackowski
c12f976389 audit: add semaphore to audit_syslog_storage_helper
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.
 - Added 1 hour timeout to the semaphore, so semaphore stalls are
   failed just as all other syslog auditing failures.

Fixes: scylladb#22973
2025-04-08 16:24:42 +02:00
Andrzej Jackowski
889fd5bc9f audit: corutinize audit_syslog_storage_helper
This change:
 - Corutinize audit_syslog_storage_helper::syslog_send_helper
 - Corutinize audit_syslog_storage_helper::start
 - Corutinize audit_syslog_storage_helper::write
2025-04-08 16:24:42 +02:00
Andrzej Jackowski
dbd2acd2be audit: moved syslog_send_helper to audit_syslog_storage_helper
This change:
 - Make syslog_send_helper() a method of audit_syslog_storage_helper, so
   syslog_send_helper() can access private members of
   audit_syslog_storage_helper in the next commits.
 - Remove unneeded syslog_send_helper() arguments that now are class
   members.
2025-04-08 16:24:42 +02:00
Benny Halevy
f702adf6a5 main: fix typo in tablet allocator checkpoint message
Inroduced in b6705ad48b

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23211
2025-04-08 17:19:41 +03:00
Botond Dénes
583a813d17 docs/dev/tombstone.md: fix link to ddl.html
Closes scylladb/scylladb#23622
2025-04-08 16:18:50 +03:00
Anna Stuchlik
93a7b3ac1d doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024
This commit adds the procedure to enable consistent topology updates for upgrades
from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled
after upgrading from 2024.1 to 2024.2).

Fixes https://github.com/scylladb/scylladb/issues/23650

Closes scylladb/scylladb#23651
2025-04-08 15:38:00 +03:00
Robert Bindar
4e3eb2fdac Move direct_failure_detector from root to service/
direct_failure_detector used to be used by gms/ as well,
but that's not the case anymore, so raft/ is the only user.

Fixes #23133

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23248
2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk
372b562f5e test: add test for rebuild with repair 2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
acd32b24d3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
eb17af6143 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
4a847df55c locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
5d6041617b locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
b80e957a40 gms: add REPAIR_BASED_TABLET_REBUILD cluster feature 2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
9769d7a564 docs: nodetool: update repair and add tablet-repair docs 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
02fb71da42 test: nodetool: add tests for cluster repair command 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
8bbc5e8923 nodetool: add cluster repair command
Add a new nodetool cluster repair command that repairs tablet keyspaces.

Users may specify keyspace and tables that they want to repair.
If the keyspace and tables are not specified, all tablet keyspaces
are repaired.

The command calls the new tablet repair API /storage_service/tablets/repair.
2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
aa3973c850 nodetool: repair: extract getting hosts and dcs to functions 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
b81c81c7f4 nodetool: repair: warn about repairing tablet keyspaces
Warn about an attempt to repair tablet keysapce with nodetool repair.

A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
cbde835792 nodetool: repair: move keyspace_uses_tablets function 2025-04-08 09:13:14 +02:00
Yaron Kaikov
2dc7ea366b .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50b1ba05736fda2e815b018858e86579 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641
2025-04-08 09:30:18 +03:00
Raphael S. Carvalho
0f59deffaa replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560
2025-04-08 07:32:58 +03:00
Botond Dénes
0d39091df2 test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
2025-04-08 00:11:36 -04:00
Botond Dénes
6c1f6427b3 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.
2025-04-08 00:11:36 -04:00
Botond Dénes
f7938e3f8b utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
2025-04-08 00:11:36 -04:00
Botond Dénes
34b18d7ef4 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.
2025-04-08 00:11:36 -04:00
Botond Dénes
e5afd9b5fb test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
2025-04-08 00:11:36 -04:00
Botond Dénes
df09b3f970 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.
2025-04-08 00:11:36 -04:00
Botond Dénes
cb76cafb60 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.
2025-04-08 00:11:35 -04:00
Botond Dénes
d126ea09ba replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.
2025-04-08 00:11:35 -04:00
Botond Dénes
7e600a0747 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.
2025-04-08 00:11:35 -04:00
Botond Dénes
6b5b563ef7 db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
2025-04-08 00:11:35 -04:00
Botond Dénes
c2518cdf1a mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.
2025-04-08 00:11:35 -04:00
Avi Kivity
8d2a41db82 Merge "Fixes for gossiper conversion to host id" from Gleb
"
The series contains fixes to gossiper conversion to host id. There are
two fixes where we could erroneously send outdated entry in a gossiper
message and a fix for force_remove_endpoint which was not converted to
work on host id and this caused it to not delete the entry in some cases
(in replace with the same ip case).
"

* 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev:
  gossiper: send newest entry in a digest message
  gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter
  gossiper: move force_remove_endpoint to work on host id
  gossiper: do not send outdated endpoint in gossiper round
2025-04-07 17:04:28 +03:00
Michał Chojnowski
827d774241 test_sstable_compression_dictionaries: reproduce an internal error in debug logging
Extend one of the test so that it reproduces #23624,
by creating a situation where no-compression SSTables are
handled with debug logging enabled.
2025-04-07 13:05:04 +02:00
Michał Chojnowski
056da4b326 compress: fix an internal error when a specific debug log is enabled
While iterating over the recent 69684e16d8,
series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)`
to be an internal error, and later calling that anyway in a debug log.

(Tests didn't catch it because there's no test which simultaneously
enables the debug log and configures some table to have no compression).

This proves that `algorithm_to_name` is too much of a footgun.
Fix it so that calling `algorithm_to_name(algorithm::none)` is legal.
In hindsight, I should have done that immediately.
2025-04-07 13:05:03 +02:00
dependabot[bot]
a899cae158 build(deps): bump sphinx-scylladb-theme from 1.8.5 to 1.8.6 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.5 to 1.8.6.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.5...1.8.6)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#23537
2025-04-07 13:42:19 +03:00
Emil Maskovsky
76ceaf129b raft: distribute voters by rack inside DC
Distribute the voters evenly across racks in the datacenters.

When distributing the voters across datacenters, the datacenters with
more racks will be preferred in case of a tie. Also, in case of
asymmetric voter distribution (2 DCs), the DC with more racks will have
more voters (if the node counts allow it).

In case of a single datacenter, the voters will be distributed across
racks evenly (in the similar manner as done for the whole datacenters).

The intention is that similar to losing a datacenter, we want to avoid
losing the majority if a rack goes down - so if there are multiple racks,
we want to distribute the voters across them in such a way that losing
the whole rack will not cause the majority loss (if possible).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
831fae4bff raft/test: fix lint warnings in test_raft_no_quorum
Code cleanup - fixed lint warnings in `test_raft_no_quorum` test.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
92f6662cd1 raft/test: add the upgrade test for limited voters feature
We test the upgrade scenario of the limited voters feature - first we
start the cluster with the limited voters feature disabled ("old code"),
then we upgrade the cluster to the version with the limited voters
feature enabled ("new code").

The nodes are being upgraded one by one and we test that the cluster
still works (doesn't e.g. lose the majority).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
a740623fa1 raft topology: handle on_up/on_down to add/remove node from voters
Adding and removing the voters based on the node up/down events.

This improves the availability of the system by automatically
adjusting the number of voters in the system to use the alive nodes in
precedence.

We can then also drop the voter removal from the `write_both_read_old`
to further simplify the code - the node will be removed from the voters
when it goes down. However we only can do that in case the feature is
enabled.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
dc6afd47b7 raft: fix the indentation after the limited voters changes
Fix the indentation that needs to be changed because of the added condition.

This is done separately to make it easier to review the main commit with
the functional changes.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
1d06ea3a5a raft: implement the limited voters feature
Currently if raft is enabled all nodes are voters in group0. However it
is not necessary to have all nodes to be voters - it only slows down
the raft group operation (since the quorum is large) and makes
deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along
1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only
limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and
rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the
  current distribution of voters - either if there are still some voter
  "slots" available, or if the new node is a better candidate than some
  existing voter (in which case the existing node voter status might be
  revoked).
* When a voter node is removed or stopped (shut down), its voter status
  is revoked and another node might become a voter instead (this can also
  depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of datacenters
  (DCs) or racks, the rebalance action might become wider (as there are
  some special rules applying to 1 vs 2 vs more DCs, also changing the
  number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible),
  meaning that we can tolerate a loss of the DC with the smaller number
  of voters (if both would have the same number of voters we'd lose the
  majority if any of the DCs is lost).
  For example, if we have 2 DCs with 2 nodes each, one of them will only
  have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has
  more racks than the other and the node count allows it, the DC with
  the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every
  DC has strictly less than half of the total voters (so a loss of any
  of the DCs cannot lead to the majority loss). Again, DCs with more
  racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way
as the regular nodes (i.e. the zero-token nodes will not take any
priority in the voter distribution). Technically it doesn't make much
sense to have a zero-token node that is not a voter (when there are
regular nodes in the same DC being voters), but currently the intended
purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs,
creating a third DC with zero-token nodes only), so for that intended
purpose no special handling is needed and will work out of the box.
If a preference of zero token nodes will eventually be needed/requested,
it will be added separately from this PR.

Currently the voter limits will not be configurable (we might introduce
configurable limits later if that would be needed/requested).

The feature is enabled by the `group0_limited_voters` feature flag
to avoid issues with cluster upgrade (the feature will be only enabled
once all nodes in the cluster are upgraded to the version supporting
the feature).

Fixes: scylladb/scylladb#18793
2025-04-07 12:31:18 +02:00
Lakshmi Narayanan Sreethar
750f4baf44 replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579
2025-04-07 13:27:22 +03:00
Emil Maskovsky
8b186ab0ff raft: drop the voter removal from the decommission
In the particular case of node decommission, this code doesn't really
matter in production and only confuses us. Losing majority is
an extremely rare event, and for this code to help one would have
to lose majority in a very specific way (exactly half of the nodes die
in a short time window during decommission), which is unrealistic.

In addition, this code will be completely irrelevant (and would never be
executed) once we implement #23266.

Refs: scylladb/scylladb#23266
2025-04-07 12:23:25 +02:00
Emil Maskovsky
00794af94d raft/test: disable the stop_before_becoming_raft_voter test
The workflow of becoming a voter changes with the "limited voters"
feature, as the node will no longer become a voter on its own, but the
votership is being managed by the topology coordinator. This therefore
breaks the `stop_before_becoming_raft_voter` test, as that injection
relies on the old behavior.

We will disable the test for this particular case for now and address
either fixing of complete removal of the test in a follow-up task.

Refs: scylladb/scylladb#23418
2025-04-07 12:23:25 +02:00
Emil Maskovsky
57df5d013e raft/test: stop the server less gracefully in the voters test
Stopping the test gracefully might hide some issues, therefore we want
to stop it forcefully to make sure that the code can handle it.

Added a parameter to stop gracefully or less gracefully (so that we test
both cases).
2025-04-07 12:22:19 +02:00
Pavel Emelyanov
10376b5b85 db: Re-use database::snapshot_table_on_all_shards()
There are two snapshot-on-all-shards methods on the database -- the one
that snapshots a keyspace and the one that snapshots a vector of tables.
The latter snapshots a single table with a neat helper, while the former
has the helper open-coded.

Re-using the helper in keyspace snapshot is worth it, but needs to patch
the helper to work on uuid, rather than ks:cf pair of strings.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23532
2025-04-07 11:55:43 +02:00
Nadav Har'El
84fd52315f alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547
2025-04-07 12:52:03 +03:00
Kefu Chai
55777812d4 s3/client: Optimize file streaming with zero-copy multipart uploads
When streaming files using multipart upload, switch from using
`output_stream::write(const char*, size_t)` to passing buffer objects
directly to `output_stream::write()`. This eliminates unnecessary memory
copying that occurred when the original implementation had to
defensively copy data before sending.

The buffer objects can now be safely reused by the output stream instead
of creating deep copies, which should improve performance by reducing
memory operations during S3 file uploads.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23567
2025-04-07 12:50:06 +03:00
Avi Kivity
ac3d25eb44 sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables
The incremental reader selector maintains an unordered_set of
sstables that are already engaged, and uses std::views::filter
to filter those out. It adds the sstable under consideration to the
set, and if addition failed (because it's already in) then it
filters it out.

This breaks if the filter view is executed twice - the first pass
will add every sstable to the set, and the second will consider
every sstable already filtered. This is what happens with
libstdc++ 15 (due to the addition of vector(from_range_t) constructor),
which uses the first pass to calculate the vector size
and the second pass to insert the elements into a correctly-sized
vector.

Fix by open-coding the loop.

Closes scylladb/scylladb#23597
2025-04-07 12:49:04 +03:00
Gleb Natapov
a982db326e gossiper: send newest entry in a digest message
In cases where two entries have the same ip address send information
only for the newest one. Now we send both which make the receiver use
one of them at random and it may be outdated one (though it should only
cause more data than needed to be requested).
2025-04-06 18:39:24 +03:00
Gleb Natapov
8d534ee68e gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter 2025-04-06 18:39:24 +03:00
Gleb Natapov
6f53611337 gossiper: move force_remove_endpoint to work on host id
Since the gossiper works on host ids now it is incorrect to leave this
function to work on ip. It makes it impossible to delete outdated entry
since the "gossiper.get_host_id(endpoint) != id" check will always be
false for such entries (get_host_id() always returns most up -to-date
mapping.
2025-04-06 18:39:24 +03:00
Amnon Heiman
b55f24c14d alternator: Add tests for the batch items histograms
This patch adds a test for the batch‑items histogram for both get and
write operations.

It update the check_increases_metric_exact helper function so that it
would get a list of expected value and labels (labels can be None).
This makes it easy to test multiple buckets in a histogram.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-04-06 18:22:23 +03:00
Amnon Heiman
c060c0b867 alternator: Add histogram for batch item count
This patch adds an estimated_histogram for alternator batch item count.
estimated_histogram can be used with values starting from 1 with an
exponential factor of 1.2, which nicely covers values up to 20, but with
only 22 buckets it can reach all the way to 100 (plus infinity).

Aside from the new histograms for get and write batches, a helper
function was added to return the histogram in the metric format without
changing its resolution (which is the metric’s default behaviour).

The histogram will be reported once per node rather than once per shard.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-04-06 18:22:13 +03:00
Marcin Maliszkiewicz
b94acfb37b test: remove alternator code from perf-simple-query
This kind of benchmark was superseded by perf-alternator
which has more options, workflows and most importantly
measures overhead of http server layer (including json parsing).

There is no need to maintain additional code in perf-simple-query.

Closes scylladb/scylladb#23474
2025-04-06 18:15:16 +03:00
Pavel Emelyanov
d4f3a3ee4f cql: Remove unused "initial_tablets" mention from guardrails
All tablets configuration was moved into its own "with tablets" section,
this option name cannot be met among replication factors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23555
2025-04-06 16:52:07 +03:00
Gleb Natapov
df6cd87bcc gossiper: do not send outdated endpoint in gossiper round
Now that the gossiper map is id based there can be a situation where two
entries have the same ip, Shadow round should send the newest one in
this cased. The patch makes it so.

Fixes: #23553
2025-04-06 15:08:03 +03:00
Nadav Har'El
431de48df9 test/alternator: test for item with many attributes
A user complained that he couldn't read or write an item with more than
16 attributes (!) in Alternator. This isn't true, but I realized that we
don't have a simple test for this case - all test use just a few attributes.
So let's add such a test, doing PutItem, UpdateItem and GetItem with 400
attributes. Unsurprisingly, the test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23568
2025-04-03 22:35:49 +03:00
Nadav Har'El
a9a6f9eecc test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578
2025-04-03 22:31:08 +03:00
Benny Halevy
cdf9fe9e50 Update seastar submodule
* seastar 2f13c461...ed8952fb (24):
  > file: explain dsync check in flush method
  > gate: add named_gate
  > tests: unit: add gate_test
  > reactor: Remove global task_quota extern declaration
  > future: Move report_failed_future to internal namespace
  > update boost cooking URL
  > smp: prefault: clear memory map after threads join
  > change format to sesatar::format
  > Prevent move / copy constructor / assignment on backtrace_buffer
  > Remove unnecesary flush calls from backtrace_buffer usage points
  > Make backtrace_buffer flush on destruction
  > Add `backtrace_buffer&` param to maybe_report_kernel_trace function
  > Prevent empty kernel callstack messages
  > Make cpu_stall_detector_linux_perf_event::maybe_report_kernel_trace function protected.
  > iotune: Add cli flag to force io depth
  > smp: prefault: decouple _stop_request from join_threads
  > reactor: more info, robustness on segfault
  > net/udp: fix ipv4_udp::next_port calculation
  > map_reduce: prevent mapper or reducer exception from poisoning state
  > build: Re-enable ASan's verify_asan_link_order check
  > tests: enable/disable internet-dependent tests at runtime
  > test: tls_test: rename test_simple_x509_client variants to avoid naming conflicts
  > tests: extend test.py to accept arbitrary ctest parameters from positional args
  > tests: add a handle for building tests in "offline" mode

Closes scylladb/scylladb#23566
2025-04-03 19:45:37 +03:00
Botond Dénes
1198213000 Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

Closes scylladb/scylladb#23478

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-03 16:32:53 +03:00
Botond Dénes
fcdae20fd1 Merge 'Add tablet enforcing option' from Benny Halevy
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Refs scylladb/scylla-enterprise#4355

* Requires backport to 2025.1

Closes scylladb/scylladb#22273

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-03 16:32:19 +03:00
Kefu Chai
3760a1c85e cql3: Remove unnecessary 'virtual' specifiers from final class methods
Remove 'virtual' specifiers from member functions in final classes where
they can never be overridden. This addresses Clang errors like:

```
/home/kefu/dev/scylladb/cql3/column_identifier.hh:85:21: error: virtual method 'to_string' is inside a 'final' class and can never be overridden [-Werror,-Wunnecessary-virtual-specifier]
   85 |     virtual sstring to_string() const;
      |                     ^
1 error generated.
```

This change improves code clarity and maintainability by eliminating
redundant modifiers that could cause confusion.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23570
2025-04-03 13:51:42 +03:00
Tomasz Grabiec
fe8187e594 Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23455

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-03 11:15:08 +02:00
Botond Dénes
7bbfa5293f test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513

Closes scylladb/scylladb#23558
2025-04-03 10:42:11 +03:00
Botond Dénes
07510c07a0 readers/mutation_readers: queue_reader_handle_v2::push_end_of_stream() raise _ex if set
Instead of raising std::runtime_error("Dangling queue_reader_handle_v2")
unconditionally. push() already raises _ex if set, best to be
consistent.
Unconditionally raising std::runtime_error can cause an error to be
logged, when aborting an operation involving a queue reader.
Although the original exception passed to
queue_reader_handle_v2::abort() is most likely handled by higher level
code (not logged), the generic std::runtime_error raised is not and
therefore is logged.

Fixes: #23550

Closes scylladb/scylladb#23554
2025-04-03 10:39:56 +03:00
Pavel Emelyanov
3bf4768205 Merge 'Unify http transport in EAR to use seastar http client' from Calle Wilund
Fixes #22925
Refs #22885

Some providers in EAR were written before seastar got its own native http connector (as it is). Thus hand-made connectivity is used there.

This PR unifies the code paths, and also extract some abstraction between providers where possible.
One big reason for this is the handling of abrupt disconnects and retries; Seastar has some handling of things like EPIPE and ECONNRESET situations, that can be safely ignored in a REST call iff data was in fact transferred etc.

This PR mainly takes the usage of seastar httpclient from gcp connector, makes a wrapper matching most of the usage of local client in kms connector, ensures common functionality and the replaces the code in the individual connectors.

Closes scylladb/scylladb#22926

* github.com:scylladb/scylladb:
  encryption::gcp: Use seastar http client wrapper
  encryption::kms: Drop local http client and use seastar wrapper
  encryption: Break out a "httpclient" wrapper for seastar httpclient
2025-04-03 10:35:14 +03:00
Kefu Chai
0cd6cf1dc5 main: Remove unused member variable _sys_ks
Fixes a Clang error by removing the unused private field
`sstable_dict_deleter::_sys_ks` that was flagged with:
[-Werror,-Wunused-private-field]
```
/home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc
/home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field]
 1660 |                 db::system_keyspace& _sys_ks;
      |                                      ^
```

The member variable is not referenced anywhere in the code,
so removing it improves maintainability without affecting
functionality.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23545
2025-04-02 20:07:39 +03:00
Evgeniy Naydanov
84a5037056 test.py: cluster/suite.yaml: update test filters
After switching to subfolders the filter `run_in_debug` for
random failures test was just copied as is, but need to include
the subfolder, actually.

Also, `test_old_ip_notification_repro` was deleted, so, we
don't need it in the `skip_in_debug` list.

Closes scylladb/scylladb#23492
2025-04-02 19:29:27 +03:00
Kefu Chai
a09ec9d60d .github: add delay before checking for required PR labels
Improve the GitHub workflow to prevent premature email notifications
about missing labels. Previously, contributors without write permissions
to the scylladb repo would receive immediate notification emails about
missing required backport labels, even if they were in the process of
adding them.

This change introduces a 1-minute grace period before checking for
required labels, giving contributors sufficient time to add necessary
labels (like backport labels) to their pull requests before any warning
notifications are sent.

The delay makes the experience more user-friendly for non-maintainer
contributors while maintaining the labeling requirements.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23539
2025-04-02 19:28:15 +03:00
Aleksandra Martyniuk
bae6711809 \test: add test to check concurrent migration and repair of two different tablets 2025-04-02 15:30:17 +02:00
Radosław Cybulski
c36614e16d alternator: add size check to BatchItemWrite
Add a size check for BatchItemWrite command - if the item count is
bigger than configuration value `alternator_maximum_batch_write_size`,
an error will be raised and no modification will happen.

This is done to synchronize with DynamoDB, where maximum size of
BatchItemWrite is 25. To avoid complaints from clients, who use
our feature of BatchWriteItem being limitless we set default value
to 100.

Fixes #5057

Closes scylladb/scylladb#23232
2025-04-02 14:48:00 +03:00
Avi Kivity
882f405eed Merge "Convert gossiper's endpoint state map to be host id based" from Gleb
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.

On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.

Interoperability was verified manually by running mixed cluster.

The series concludes the conversion of the system to be host id based.
"

* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
  gossiper: make examine_gossiper private
  gossiper: rename get_nodes_with_host_id to get_node_ip
  treewide: drop id parameter from gossiper::for_each_endpoint_state
  treewide: move gossiper to index nodes by host id
  gossiper: drop ip from replicate function parameters
  gossiper: drop ip from apply_new_states parameters
  gossiper: drop address from handle_major_state_change parameter list
  gossiper: pass rpc::client_info to gossiper_shutdown verb handler
  gossiper: add try_get_host_id function
  gossiper: add ip to endpoint_state
  serialization: fix std::map de-serializer to not invoke value's default constructor
  gossiper: drop template from  wait_alive_helper function
  gossiper: move get_supported_features and its users to host id
  storage_service: make candidates_for_removal host id based
  gossiper: use peers table to detect address change
  storage_service: use std::views::keys instead of std::views::transform that returns a key
  gossiper: move _pending_mark_alive_endpoints to host id
  gossiper: do not allow to assassinate endpoint in raft topology mode
  gossiper: fix indentation after previous patch
  gossiper: do not allow to assassinate non existing endpoint
2025-04-02 12:30:00 +03:00
Pavel Emelyanov
832d83ae4b sstables_loader: Do not stop sharded<progress_monitor> unconditionally
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.

fixes: #23231

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23483
2025-04-02 12:09:02 +03:00
Kefu Chai
6da758d74c config: mark uuid_sstable_identifiers_enabled unused
the option of `uuid_sstable_identifier_enabled` was introduced in
f014ccf3 . the first version which has this change was 5.4, and
6.1 has been branched. during the discussion of backup and restore,
we realized that we've been taking efforts to address problems which
could have been addressed with the sstable with UUID-based identifier.
see also #10459 which is the issue which proposed to implement UUID-v1
based sstable identifier.

now that two major releases passed, we should have the luxury to mark
this option "unused". this option which was previously introduced to
keep the backward compatibility, and to allow user to opt-out of the
feature for some reasons.

so in this change,  mark the option unused, so that if any user still
sets this option with command line, they will get a clear error. but
we still parse and handle this setting in `scylla.yaml`, so that this
option is still respected for existing settings, and for existing tests,
which are not yet prepared for the uuid-based sstable identifiers.

Refs #10459
Fixes #20337

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20341
2025-04-01 20:21:47 +03:00
Botond Dénes
3bad46a6e2 docs/dev: add tombstone.md
An exhaustive document on the tombstone related internal logic as well
as the user-facing aspects.

Closes scylladb/scylladb#23454
2025-04-01 20:17:57 +03:00
Botond Dénes
a0d8102a1f replica/memtable: s/make_flat_reader/make_mutation_reader/
Following the recent refactoring of removing "flat" and "v2" from reader
names, replacing all the fully qualified names with simply "mutation_reader".

Closes scylladb/scylladb#23346
2025-04-01 17:58:13 +03:00
Artsiom Mishuta
032b28d793 test.py: remove pylib_test from test.py/CI run
pylib_test contains one pure Python test. This test does not test Scylla.
This test is not deleted because it can be useful to run during pre-commit,
for example, but it definitely should not be run in CI in modes with 3 repeats each.
It does not make sense. It is a Unit test for test.py framework.

Note: test still can be easily run by pytest via the command:
./tools/toolchain/dbuild pytest test/pylib_test

Closes scylladb/scylladb#23181
2025-04-01 16:43:45 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Avi Kivity
69684e16d8 Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski
This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes:

- We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards.
- We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options).
- We add a cluster feature which guards the creation of dictionary-compressed SSTables.
- We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards.
- We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries.
- We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that).
- We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio.
- We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement.

Known imperfections:
- The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first.

New feature, no backporting involved.

Closes scylladb/scylladb#23025

* github.com:scylladb/scylladb:
  docs: add user-facing documentation for SSTable compression with shared dicts
  docs/dev: add sstable-compression-dicts.md
  test: add test_sstable_compression_dictionaries_autotrain.py
  test: add test_sstable_compression_dictionaries_basic.py
  test/pylib/rest_client: add `keyspace_upgrade_sstables` helper
  main: run a sstable_dict_autotrainer
  api: add the estimate_compression_ratios API call
  dict_autotrainer: introduce sstable_dict_autotrainer
  db/system_keyspace: add query_dict_timestamp
  compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
  main: clean up sstable compression dicts after table drops
  sstables/compress: discard hidden compression options after the decompressor is created
  compress: change compressor_ptr from shared_ptr to unique_ptr
  api: add the retrain_dict API call
  storage_service: add some dict-related routines
  main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
  storage_service: add do_sample_sstables()
  messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
  db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict
  raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback
  database: add sample_data_files()
  database: add take_sstable_set_snapshot()
  compress: teach `lz4_processor` about dictionaries
  compress: teach `zstd_processor` about dictionaries
  sstables: delegate compressor creation to the compressor factory
  sstables: plug an `sstable_compressor_factory` into `sstables_manager`
  sstables: introduce sstable_compressor_factory
  utils/hashers: add get_sha256()
  gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
  compress: add hidden dictionary options
  compress: remove `compression_parameters::get_compressor()`
  sstables/compress: remove get_sstable_compressor()
  sstables/compress: move ownership of `compressor` to `sstable::compression`
  compress: remove compressor::option_names()
  compress: clean up the constructor of zstd_processor
  compress: squash zstd.cc into compress.cc
  sstables/compress: break the dependency of `compression_parameters` on `compressor`
  compress.hh: switch compressor::name() from an instance member to a virtual call
  bytes: adapt fmt_hex to std::span<const std::byte>
2025-04-01 12:47:34 +03:00
Aleksandra Martyniuk
1dc29ddc86 repair: release erm in repair_writer_impl::create_writer when possible
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.

Fixes: #23453.
2025-04-01 11:34:21 +02:00
Calle Wilund
c6674619b7 encryption::gcp: Use seastar http client wrapper
Refs #22925

Remove direct usage of seastar http client, and instead share this
with other connectors via the http client wrapper type.
2025-04-01 08:18:05 +00:00
Calle Wilund
491748cde3 encryption::kms: Drop local http client and use seastar wrapper
Fixes #22925

Removes the boost based http client in favour of our seastar
wrapper.
2025-04-01 08:18:05 +00:00
Calle Wilund
878f76df1f encryption: Break out a "httpclient" wrapper for seastar httpclient
Refs #22925

Adds some wrapping and helpers for the kind of REST operations we
expect to perform.

Some things like stream formatting is redundant visavi seastar,
but on that level we only have \r\n encoded writing to
output_stream and similar, which is less useful for things like
logging.
2025-04-01 08:18:05 +00:00
Piotr Smaron
370707b111 service: restore default timeout in announce_with_raft
This restored timeout seems to have been accidentally removed in
7081215552 (r2005352424).
Without it, `raft_server_with_timeouts::run_with_timeout` will get
`std::nullopt` as a value of the `timeout` parameter and perform an
operation without any timeout, whereas previously it would have waited
for the default timeout specified in
`raft_server_for_group::default_op_timeout`.

Closes scylladb/scylladb#23380
2025-04-01 10:20:16 +03:00
David Garcia
6e61fc323b docs: redirect to docs.scylladb.com/manual/
Define a custom alert to redirect users to the latest version of the docs in https://docs.scylladb.com/manual/

Closes scylladb/scylladb#22636
2025-04-01 09:22:56 +03:00
Botond Dénes
bd9f51a29c Merge 'transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Vladislav Zolotarov
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

Closes scylladb/scylladb#23174

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-01 09:16:02 +03:00
Pavel Emelyanov
b5a124f60c sstable_directory: Move highest_generation_seen() to distributed_loader.cc
This method is only used by the loader code (and tests). Also, There's the
highest_version_seen() peer that sits in the loader code either.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23324
2025-04-01 09:15:14 +03:00
Pavel Emelyanov
eafc767cc6 sstable/filesystem: Add convenience helper to generate filename
In its operations the fs storage carefully generates full filename from
all sstable parameters -- version, format, generation, keyspace and
table names and component type or name. However, in all of the cases
format, version and keyspace:table names are inherited from the sstable
being operated on. This calls for a filename generation helper that
wraps most of the arguments thus making the lines shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23384
2025-04-01 09:14:44 +03:00
Botond Dénes
0fdf2a2090 Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
So that a multi-dc/multi-rack cluster can be populated
in a single call.

* Enhancement, no backport required

Closes scylladb/scylladb#23341

* github.com:scylladb/scylladb:
  test/pylib: servers_add: add auto_rack_dc parameter
  test/pylib: servers_add: support list of property_files
2025-04-01 09:14:20 +03:00
Botond Dénes
94e8971308 scylla-gdb.py: improve scylla repairs commadn
Make output more readable by:
* group follower/master repair instances separately
* split repair details into one line for repair summary, then one line
  for each host info
* add indentation to make the output easier to follow

Also add -m|--memory option to calculate memory usage of repair buffers.

Example output:

    (gdb) scylla repairs -m
    Repairs for which this node is leader:
      (repair_meta*) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
      (repair_meta*) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished
      (repair_meta*) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished
      (repair_meta*) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
    Repairs for which this node is follower:
2025-04-01 01:53:35 -04:00
Botond Dénes
47c62a4cf2 scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__
There is currently no easy way to null-check seastar_lw_shared_ptr.
Comparing get() against 0 doesn't work, if _p is null, get() will return
an illegal pointer. So add methods to allow for easy null-checks by
comparing _p with 0 instead.
2025-04-01 01:53:34 -04:00
Botond Dénes
f84bf43c96 scylla-gdb.py: introduce managed_bytes
Extracted from managed_bytes_printer. Make working with managed_bytes
easier. Abstracts how size and content is obtained.
2025-04-01 01:53:34 -04:00
Jenkins Promoter
6c528f5027 Update pgo profiles - aarch64 2025-04-01 04:45:44 +03:00
Jenkins Promoter
3c12029584 Update pgo profiles - x86_64 2025-04-01 04:27:11 +03:00
Michał Chojnowski
36be9d1c9b docs: add user-facing documentation for SSTable compression with shared dicts 2025-04-01 00:07:31 +02:00
Michał Chojnowski
d33ffb221b docs/dev: add sstable-compression-dicts.md 2025-04-01 00:07:31 +02:00
Michał Chojnowski
f851efd4fa test: add test_sstable_compression_dictionaries_autotrain.py
Adds a test which checks that sstable compression dict autotraining
does its job.
2025-04-01 00:07:31 +02:00
Michał Chojnowski
62da3d8363 test: add test_sstable_compression_dictionaries_basic.py
Add a basic integration test for SSTable compression with shared dictionaries.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
7b0eeefd79 test/pylib/rest_client: add keyspace_upgrade_sstables helper 2025-04-01 00:07:30 +02:00
Michał Chojnowski
3f7969313f main: run a sstable_dict_autotrainer
Create an instance of `sstable_dict_autotrainer` in `scylla_main`
and run it.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
a19d6d95f7 api: add the estimate_compression_ratios API call
Add an API call which estimates the effectiveness of possible
compression config changes.

This can be used to make an informed decision about whether to
change the compression method, without actually recompressing
any SSTables.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
4f0d453acf dict_autotrainer: introduce sstable_dict_autotrainer
Add a fiber responsible for periodic re-training of compression dictionaries
(for tables which opted into dict-aware compression).

As of this patch, it works like this:
every `$tick_period` (15 minutes), if we are the current Raft leader,
we check for dict-aware tables which have no dict, or a dict older
than `$retrain_period`.

For those tables, if they have enough data (>1GiB) for a training,
we train a new dict and check if it's significantly better
than the current one (provides ratio smaller than 95% of current ratio),
and if so, we update the dict.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
9d02e2c005 db/system_keyspace: add query_dict_timestamp
Adds a helper method which queries the creation timestamp
of a given dict in `system.dicts`.

We will later use the age of the current SSTable compression dict
to decide if another training should be done already.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
cb1b291051 compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
Add new compressor names to `sstable_compression`.
When those names are configured in the schema,
new SSTables will be compressed with dict-aware Zstd or LZ4
respectively.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
bea866a46f main: clean up sstable compression dicts after table drops
When a table is dropped, its corresponding dictionary in `system.dicts`
-- if any -- should be deleted, otherwise it will remain forever as
garbage.

This commit implements such cleanup.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
cee504f66f sstables/compress: discard hidden compression options after the decompressor is created
Dictionary contents are kept in the list of "compression options" in the
header of `CompressionInfo.db`, and they are loaded from disk into
memory when the `sstable::compression` object is populated.

After the decompressor for the SSTable is created based on those
dict contents, they are not needed in RAM anymore. And since
they take up a sizeable amount of memory, we would like to free them.

In this patch, we discard all "hidden compression options"
(currently: only the dictionary contents) from the
`sstable::compression` object right after the decompressor is created.
(Those options are not supposed to be used for anything else anyway).
2025-04-01 00:07:30 +02:00
Michał Chojnowski
10fa4abde7 compress: change compressor_ptr from shared_ptr to unique_ptr
Cleanup patch. After we moved the ownership of compressors
to sstables, compressor objects never have shared lifetime.
`unique_ptr` is more appropriate for them than `shared_ptr` now.
(And besides expressing the intent better, using `unique_ptr`
prevents an accidental cross-shard `shared_ptr` copy).
2025-04-01 00:07:29 +02:00
Michał Chojnowski
58ae278d10 api: add the retrain_dict API call
Add an API call which will retrain the SSTable compression dictionary
for a given table.

Currently, it needs all nodes to be alive to succeed. We can relax this later.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
4115a6fece storage_service: add some dict-related routines
storage_service will be the interface between the API layer
(or the automatic training loop) and the dict machinery.
This commit implements the relevant interface for that.

It adds methods that:
1. Take SSTable samples from the cluster, using the new RPC verbs.
2. Train a dict on the sample. (The trainer will be plugged in from `main`).
3. Publishes the trained dictionary. (By adding mutations to Raft group 0).

Perhaps this should be moved to a separate "service".
But it's not like `storage_service` has a clear purpose anyway.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
94d244ab49 main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
Currently, there is at most one dictionary in `system.dicts`:
named "general", used by RPC compression. So the callback called
on `system.dicts` just always refreshes the RPC compression dict.

In a follow-up commit, we will publish SSTable compression dicts to
`system.dicts` rows with a name in the "sstables/{table_uuid}" format.
We want modification to such rows to be passed as new dictionary
recommendations to the SSTable compressor factory. This commit teaches
the `system.dicts` modification callback to recognize such modifications
and forward them to the compressor factory.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
380f409c46 storage_service: add do_sample_sstables()
Adds a helper which uses ESTIMATE_SSTABLE_VOLUME and SAMPLE_SSTABLES
RPC calls to gather a combined sample of SSTable Data files for the given table
from the entire cluster.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
94c33b6760 messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
Add two verbs needed to implement dictionary training for SSTable
compression.

SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files
with a given cardinality and using a given chunk size,
for the given table.

ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data
files the given table.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
4856f4acca db/system_keyspace: let system.dicts helpers be used for dicts other than the RPC compression dict
Extend the `system.dicts` helper for querying and modifying
`system.dicts` with an ability to use names other than "general".
We will use that in later commits to publish dictionaries for SSTable compression.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
d920ab5366 database: add sample_data_files()
Add a helper for sampling the Data files for a given table.
We will use it to take samples for dictionary training.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
48c06c7e4b database: add take_sstable_set_snapshot()
We want a method that will allow us to take a stable snapshot of
SSTables, to asynchronously compute some stats on them.
But `take_storage_snapshot` is overly invasive for that, because
it flushes memtables on each call.
(If `take_storage_snapshot` was, for example, called repetitively,
it could create a ton of small memtables and lead to trouble).

This commit adds a weaker version which only takes a snapshot of
*existing SSTables*, and doesn't flush memtables by itself.

This will be useful for dictionary training, which doesn't
care about the semantics of SSTables, only their rough statistical
properties.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
64f3d7e364 compress: teach lz4_processor about dictionaries
Extend `lz4_processor` with the ability to use dictionaries.
We won't use this ability yet. It will be used when new
compressor names are added.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
b65101b371 compress: teach zstd_processor about dictionaries
Extend `zstd_processor` with the ability to use dictionaries.
We won't use this ability yet. It will be used when new
compressor names are added.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
b18ddcb92e sstables: delegate compressor creation to the compressor factory
Remove `compressor::create()`. This enforces that compressors
are only created through the `sstable_compressor_factory`.

Unlike the synchronous `compressor::create()`, the factory will be able
to create dict-aware compressors.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
ebf02913a2 sstables: introduce sstable_compressor_factory
Before this commit, `compressor` objects are synchronously
created, during the creation or opening of SSTables,
from `compression_parameters` objects.

But we want to add compression dictionaries to SSTables and we want
to share dictionary contents across shards.
To do that, we need to make the creation of `compressor` objects asynchronous,
and give it access to a global dictionary registry.

We encapsulate that in a `sstable_compression_factory`. Instead of
calling `compressor::create()` on SSTable opening or creation, we will
ask the factory, asynchronously, for a new compressor, and it will return
a compressor with a deduplicated, up-to-date dictionary.

This commit introduces such a factory. It's not used anywhere yet,
and the compressors it produces don't use the provided dictionaries yet.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
2bd393849c utils/hashers: add get_sha256()
Add a helper function which computes the SHA256 for a blob.
We will use it to compute identifiers for SSTable compression
dictionaries later.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
61316e29df gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
This feature will guard against writing SSTables containing compression
dictionaries before the entire cluster is able to understand them.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
dd932ebb2f compress: add hidden dictionary options
Before this commit, "compression options" written into
CompressionInfo.db (and used to construct a decompressor)
have a 1:1 correspondence to "compression options" specified
in the schema.

But we want to add a new "compression option" -- the compression
dictionary -- which will be written into CompressionInfo.db
and used to construct decompressors, but won't be specified in the
schema.

To reconcile that, in this commit we introduce the notion of a "hidden
option". If an option name in `CompressionInfo.db` begins with a dot,
then this option will be used to construct decompressors, but won't
be visible for other uses. (I.e. for the `sstable_info` API call
and for recovering a fake `schema` from `CompressionInfo.db` in the
`scylla sstable` tool).

Then, we introduce the hidden `.dictionary.{0,1,2,..}` options,
which hold the contents of the dictionary blob for this SSTable.

(The dictionary is split into several parts because the SSTable
format limits the length of a single option value to 16 bits,
and dictionaries usually have a length greater than that).

This commit only introduces helpers which translate dictionary blobs
into "options" for CompressionInfo.db, and vice-versa, but it doesn't
use those helpers yet. They will be used in later commits.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
11be7c0704 compress: remove compression_parameters::get_compressor()
Following up on the previous commits, we avoid constructing
compressors where not necessary,
by checking things directly on `compression_parameters` instead.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
006c631642 sstables/compress: remove get_sstable_compressor()
Following up on the previous commit, we avoid constructing
a compressor in the `sstable_info` API call, and we instead
read the compression options from the `sstable::compression`.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
8e611536b0 sstables/compress: move ownership of compressor to sstable::compression
SSTable readers and writers use `compressor` objects to compress and
decompress chunks of SSTable data files.

`compressor` objects are read-only, so only one of them is needed
for each SSTable. Before this commit, each reader and writer has
its own `compressor` object. This isn't necessary, but it's okay.

But later in this series it will stop being okay, because the creation
of a `compressor` will become an expensive cross-shard
operation (because it might require sharing a compression dictionary
from another shard). So we have to adjust the code so that there is
only once `compressor` per sstable, not one per reader/writer.

We stuff the ownership of this compressor into `sstable::compression`.

To make the ownership clear, we remove `compression_ptr` shared
pointers from readers and writers, and make them access the
compressor via the `sstable::compression` instead.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
7bdcd5e8c1 compress: remove compressor::option_names()
It used to be used by `compression_parameters` validation logic
to ask the created `compressor` for compressor-specific option names.

Since we no longer delegate this to `compressor`, but we just
put the knowledge of those options directly into
`compressor_parameters`, it's dead code now.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
3b0ab8e1ee compress: clean up the constructor of zstd_processor
Since we now parse and validate the compression level during the
construction of `compression_parameters`, we can just pass the
structured params to `zstd_processor` instead of passing
a raw string map.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
6470035a74 compress: squash zstd.cc into compress.cc
Unlike all other implementations of `compressor`, `zstd_processor`
has its own special object file and its own special
late binding mechanism (via the `class_registry`).
It doesn't need either.

Let's squash it into `compress.cc`. Keeping `zstd_processor` a separate "module"
would require adding even more headers and source files later in the
series (when adding dictionaries), and there's no benefit in being
so granular. All `compressor` logic can be in `compress.cc` and it will
still be small enough.

This commit also gets rid of the pointless `class_registry` late binding
mechanism and just constructs the `zstd_processor` in
`compressor::create()` with a regular constructor call.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
cfe69e057f sstables/compress: break the dependency of compression_parameters on compressor
Note: this commit is meant to be a code refactoring only and is not intended
to change the observable behaviour.

Today `schema` contains a `compression_parameters`.
`compression_parameters` contains an instance of
`compressor`, and SSTable writers just share that instance.

This is fine because `compressor` is a stateless object,
functionally dependent on the schema.

But in later parts of the series, we will break this functional
dependency by adding dictionaries to compressors. Two writers
for the same schema might have different dictionaries, so they won't
be able to just share a single instance contained in the schema.

And when that happens, having a `compressor` instance
in the `schema`/`compression_parameters` will become awkward,
since it won't be actually used. It will be only a container for options.

In addition, for performance reasons, we will want to share some pieces
of compressors across shards, which will require -- in the general case --
a construction of a compressor to be asynchronous, and therefore not
possible inside the constructor of `compression_parameters`.

This commit modifies `compression_parameters` so that it doesn't hold or
construct instances of `compressor`.

Before this patch, the `compressor` instance constructed in
`compression_parameters` has an additional role of validating and
holding compressor-specific options.
(Today the only such option is the zstd compression level).

This means that the pieces of logic responsible for compressor-specific
options have to be rewritten. That ends up being the bulk of this commit.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
f4ca94d13b compress.hh: switch compressor::name() from an instance member to a virtual call
Before this patch, `compressor` is designed to be a proper abstract
class, where the creator of a compressor doesn't even know
what he's creating -- he passes a name, and it gets turned into a
`compressor` behind a scenes.

But later, when creation of compressors will involve looking up
dictionaries, this abstraction will only get in the way.
So we give up on keeping `compressor` abstract, and instead of
using "opaque" names we turn to an explicit enum of possible compressor types.

The main point of this patch is to add the `algorithm` enum and the `algorithm_to_name()`
function. The rest of the patch switches the `compressor::name()` function
to use `algorithm_to_name()` instead of the passed-by-constructor
`compressor::_name`, to keep a single source of truth for the names.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
4f634de2e9 bytes: adapt fmt_hex to std::span<const std::byte>
This allows us to hexdump things other than `bytes_view`.
(That is, without reinterpret_casting them to `bytes_view`,
which -- aside from the inconvenience -- isn't quite legal.
In contrast, any span can be legally casted to `std::span<const std::byte>`).
2025-04-01 00:07:27 +02:00
Robert Bindar
b647196121 Remove db::config::object_storage_config
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 17:15:12 +03:00
Gleb Natapov
3abe5de8bf gossiper: make examine_gossiper private 2025-03-31 16:50:50 +03:00
Gleb Natapov
afdfde8300 gossiper: rename get_nodes_with_host_id to get_node_ip
Also change it to return std::optional instead of std::set since now
there can be only on ip mapped to an id.
2025-03-31 16:50:50 +03:00
Gleb Natapov
28fb84117d treewide: drop id parameter from gossiper::for_each_endpoint_state
We have it in endpoint_state anyway, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
4609bbbbb2 treewide: move gossiper to index nodes by host id
This patch changes gossiper to index nodes by host ids instead of ips.
The main data structure that changes is _endpoint_state_map, but this
results in a lot of changes since everything that uses the map directly
or indirectly has to be changed. The big victim of this outside of the
gossiper itself is topology over gossiper code. It works on IPs and
assumes the gossiper does the same and both need to be changed together.
Changes to other subsystems are much smaller since they already mostly
work on host ids anyway.
2025-03-31 16:50:50 +03:00
Gleb Natapov
19ac05b0ba gossiper: drop ip from replicate function parameters
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
c5b8429bec gossiper: drop ip from apply_new_states parameters
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
6da5f541a2 gossiper: drop address from handle_major_state_change parameter list
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
5e06bf76e0 gossiper: pass rpc::client_info to gossiper_shutdown verb handler
It will be needed later to obtain host id of the peer.
2025-03-31 16:50:50 +03:00
Gleb Natapov
704580b197 gossiper: add try_get_host_id function
The function returns unengaged std::optional if id is not found instead
of throwing like get_host_id does.
2025-03-31 16:50:45 +03:00
Tomasz Grabiec
29d1c2adc6 Merge 'Finalize tablet splits earlier' from Lakshmi Narayanan Sreethar
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.

Also added a testcase to verify the fix.

Fixes #21762

Improvement : No need to backport.

Closes scylladb/scylladb#22148

* github.com:scylladb/scylladb:
  topology_coordinator: fix indentation in generate_migration_updates
  topology_coordinator: do not schedule migrations when there are pending resize finalizations
  load_balancer: make repair plans only when there is no pending resize finalization
2025-03-31 14:42:34 +02:00
Gleb Natapov
6999b474a1 gossiper: add ip to endpoint_state
Store endpoint's IP in the endpoint state. Currently it is stored as a key
in gossiper's endpoint map, but we are going to change that. The new filed
is not serialized when endpoint state is sent over rpc, so it is set by
the rpc handler from the value in the map that is in the rpc message. This
map will not be changed to be host id based to not break interoperability.
2025-03-31 15:42:08 +03:00
Gleb Natapov
9bb2edcae6 serialization: fix std::map de-serializer to not invoke value's default constructor 2025-03-31 15:42:07 +03:00
Gleb Natapov
e5cc3b75f8 gossiper: drop template from wait_alive_helper function
Move ip to id translation to the caller.
2025-03-31 15:42:07 +03:00
Gleb Natapov
0dd86b4f1d gossiper: move get_supported_features and its users to host id 2025-03-31 15:42:07 +03:00
Gleb Natapov
f97bb6922d storage_service: make candidates_for_removal host id based 2025-03-31 15:42:07 +03:00
Gleb Natapov
82491cec19 gossiper: use peers table to detect address change
This requires serializing entire handle_state_normal with a lock since
it both reads and updates peers table now (it only updated it before the
change). This is not a big deal since most of it is already serialized
with token metadata lock. We cannot use it to serialize peers writes
as well since the code that removes an endpoint from peers table also
removes it from gossiper which causes on_remove notification to be called
and it may take the metadata lock as well causing deadlock.
2025-03-31 15:41:44 +03:00
Tomasz Grabiec
6bff596fce tablets: Make tablet allocation equalize per-shard load
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378
2025-03-31 14:34:30 +02:00
Gleb Natapov
1c2a9257e9 storage_service: use std::views::keys instead of std::views::transform that returns a key 2025-03-31 15:25:39 +03:00
Gleb Natapov
a581a99dbf gossiper: move _pending_mark_alive_endpoints to host id
Index _pending_mark_alive_endpoints map by host id instead of ip
2025-03-31 15:25:39 +03:00
Gleb Natapov
555149c153 gossiper: do not allow to assassinate endpoint in raft topology mode
It does nothing but harm in raft topology mode.
2025-03-31 15:25:39 +03:00
Gleb Natapov
4cc1c10035 gossiper: fix indentation after previous patch 2025-03-31 15:25:39 +03:00
Gleb Natapov
e8b7aaa0d4 gossiper: do not allow to assassinate non existing endpoint
We assume that all endpoint states have HOST_ID set or the host id is
available locally, but the assassinate code injects a state without
HOST_ID for not existing endpoint violating this assumption.
2025-03-31 15:25:39 +03:00
Botond Dénes
90c20858ed Merge 'test/database: Remove most of take_snapshot() helper overloads and re-use them more' from Pavel Emelyanov
This helper facilitate snapshot creation by various test cases in database_test.cc. This PR generalizes all overloads into one that suits all callers and patches one more test case to use it as well.

Closes scylladb/scylladb#23482

* github.com:scylladb/scylladb:
  test/database: Re-use take_snapshot() helper once more
  test/database: Remove most of take_snapshot() helper overloads
2025-03-31 15:20:51 +03:00
Benny Halevy
5f2ce0b022 loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock
Rather than lowres_clock, as since
32b7cab917,
loading_cache_for_test uses manual_clock for timing
and relying on lowres_clock to time the test might
run out of memory on fast test machines.

Fixes #23497

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23498
2025-03-31 14:53:06 +03:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Pavel Emelyanov
ac582efb44 test/database: Re-use take_snapshot() helper once more
There's a test case that can call the recently patched take_snapshot()
helper as well. This changes nothing, but makes further patching a bit
simpler (not in this branch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-31 13:18:06 +03:00
Pavel Emelyanov
7e6380b6bd test/database: Remove most of take_snapshot() helper overloads
There are 3 of those that help tests (re)shuffle cql_test_env/database,
skip_flush == true/false options and keyspace/table/snapshot names.
There's little sense in having that many of those, just one overload
with default arguments suits most of the callers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-31 13:18:06 +03:00
Botond Dénes
ea55eed037 Merge 'Snapshot several tables at once in scrub API handler' from Pavel Emelyanov
The scrub API handler may want to snapshot several tables. For that, it calls snapshot-ctl method to snapshot a single table for each table in the list. That's excessive, snapshot-ctl has a method to snapshot a bunch of tables at once, just what the scrub handler needs.

It's an improvement, so no need to backport

Closes scylladb/scylladb#23472

* github.com:scylladb/scylladb:
  snapshot-ctl: Remove unused snapshot-single-table method
  api: Snapshot all tables at once in scrub handler
2025-03-31 13:00:32 +03:00
Piotr Smaron
aff8cbc6f3 CODEOWNERS: remove expired owners
Removing krzaq, who's no longer with the company.
Removing core-frontend team members from Alternator areas, as it's no
longer the domain of this team.

Closes scylladb/scylladb#23500
2025-03-31 11:37:51 +03:00
Pavel Emelyanov
0077acd1bb api: Properly validate table in tablet add|del replica handlers
The handlers in question just go and call database.find_column_family,
in case the table in question doesn't exist, the no_such_column_family
exception would be thrown, which is not nice. Proper behavior is to
throw bad_param one and there's a helper that does it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23389
2025-03-31 10:03:17 +02:00
Andrzej Jackowski
c89d8c6566 cql3: prevent from empty option use in cf_statement::column_family()
Implementation of cf_statement::column_family() dereferences _cf_name
option without checking if the option is non-empty. On enterprise
branch, there is a safeguard that prevents from such an empty option
dereferencing. Although the current code on master seems to not call
columny_family() when _cf_name is empty, it is safer to introduce the
same workaround on master, to avoid any regression.

This change:
 - Prevent from empty option use in cf_statement::column_family()

Fixes: scylla-enterprise#5273

Closes scylladb/scylladb#23366
2025-03-31 09:43:22 +03:00
Michał Chojnowski
e23fdc0799 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397
2025-03-31 09:40:32 +03:00
Avi Kivity
2b9e1e61d0 docs: reader_concurrency_semaphore: document CPU concurrency limit
Document the CPU concurrency implemented in 3d816b7c16
and adjusted in 3d12451d1f.

Closes scylladb/scylladb#23404
2025-03-31 09:39:55 +03:00
Dawid Mędrek
b0b0c5905e test/cluster/test_multidc: Clean up RF-rack-valid keyspaces tests
There are some minor things we should fix that are a remnant
of the original changes (scylladb/scylladb@7646e14).

Closes scylladb/scylladb#23429
2025-03-31 09:38:42 +03:00
David Garcia
1a7be07b8c docs: renders os-support from json file
docs: renders os-support from json file

Closes scylladb/scylladb#23436
2025-03-31 09:36:49 +03:00
Marcin Maliszkiewicz
e3f2ebd4fb cql3: remove not needed cmd copy in indexed_table_select_statement
It's not used variable. There should be a tiny perf increase as
it saves allocation.

Closes scylladb/scylladb#23473
2025-03-31 09:34:32 +03:00
Avi Kivity
73e4a3c581 sstables: store features early in write path
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.

Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).

However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.

Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.

Fixes #23484

Closes scylladb/scylladb#23485
2025-03-31 09:33:56 +03:00
Pavel Emelyanov
693387bda6 Merge 'test.py: topology: allow to run tests with bare pytest command' from Evgeniy Naydanov
Add possibility to run topology tests using bare pytest command.

To achieve this goal the following changes were made:

- Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`.
- To build `TestSuite` object we need to discover a corresponding `suite.xml` file.  Do this by walking up thru the fs tree starting from the current test file.
- Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided.

And made some refactoring:

- Add path constants to `test` module and use them in different test suites instead of own dups of the same code:
  - TOP_SRC_DIR : ScyllaDB's source code root directory
  - TEST_DIR : the directory with test.py tests and libs
  - BUILD_DIR : directory with ScyllaDB's build artifacts
- Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself.
- Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`
- Refactor `ResourceGather*` classes to use path from a `test` object instead of providing it separately.
- Move modes constants (`all_modes`/`ALL_MODES` and `debug_modes`/`DEBUG_MODES`) to `test` module and remove duplication.
- Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to`pylib.suite.base` to avoid circular imports.
- In some places refactor to use f-strings for formatting.

Also minor changes related to running with pytest-xdist:

- When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them.
- Pass random seed across xdist workers using env variable.

Closes scylladb/scylladb#22960

* github.com:scylladb/scylladb:
  test.py: async_cql: remove unused event_loop fixture
  test.py: random_failures: make it play well with xdist
  test.py: add xdist worker ID to log filenames
  test.py: topology: run tests using bare pytest command
  test.py: add fixtures for current test suite and test
  test.py: refactor paths constants and options
2025-03-31 09:30:06 +03:00
Benny Halevy
a4aa4d74c1 test/pylib: servers_add: add auto_rack_dc parameter
To quickly populate nodes in a single dc,
each node in its own rack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:23:40 +03:00
Benny Halevy
c4dbb11c87 test/pylib: servers_add: support list of property_files
So that a multi-dc/multi-rack cluster can be populated
in a single call.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:12:39 +03:00
Piotr Smaron
a2bbbc6904 auth: forbid modifying system ks by non-superusers
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.

Fixes: scylladb/scylladb#23218

Closes scylladb/scylladb#23219
2025-03-30 16:55:04 +03:00
Ferenc Szili
2c9b312b58 test: port of test and reproducer for resurrection during file based streaming
This change ports test/cluster/test_resurrection.py from enterprise to
master. Because the underlying issue deals with file based streaming,
this test was a part of the enterprise repo. It contains the test and
reproducer for the issue described below:

When tablets are migrated with file-based streaming, we can have a situation
where a tombstone is garbage collected before the data it shadows lands. For
instance, if we have a tablet replica with 3 sstables:

1 sstable containing an expired tombstone
2 sstable with additional data
3 sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed
above, the first two sstables can be compacted before the third sstable arrives.
In that case, the expired tombstone will be garbage collected, and data in the
third sstable will be resurrected after it arrives to the pending replica.

The fix for the issue was merged in b66479ea98

This patch only ports the missing test.

Closes scylladb/scylladb#23466
2025-03-30 13:39:40 +03:00
Andrzej Jackowski
b8adbcbc84 audit: fix empty query string in BATCH query
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.

This changes:
 - Add missing call of add_raw() to Cql.g
 - Include other related changes (from PR#3228 in scylla-enterprise)

Fixes scylladb#23311

Closes scylladb/scylladb#23315
2025-03-30 13:37:11 +03:00
Michał Chojnowski
79a477ecb6 cmake: add the -dynamic-linker=... form to the -dynamic-linker regex
On my system (Nix), the compiler produces a `-dynamic-linker=/nix/store/...` in
the linker call scanned by get_padded_dynamic_linker_option.
But the regex can't deal with the `=` there, it requires a ` `. Fix that.

We also do the same in configure.py, and remove the Nix-specific hack
which used to disable the entire mechanism.

Closes scylladb/scylladb#22308
2025-03-30 11:58:47 +03:00
Kefu Chai
7814f6d374 github: improve seastar bad include check
for better developer experience:

- add inline annotations using problem matchers, see
  https://github.com/actions/toolkit/blob/main/docs/problem-matchers.md
- use a single step for uploading both output files, because the `path`
  setting is actually passed to
  [@actions/glob](https://github.com/actions/toolkit/tree/main/packages/glob),
  i removed the double quotes and the leading "./"
  from the paths.
- use "::error" workflow command to signify the failure, see
  https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#example-creating-an-annotation-for-an-error

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23310
2025-03-30 11:56:18 +03:00
Evgeniy Naydanov
1a0c14aa50 test.py: async_cql: remove unused event_loop fixture
Newer version of pytest-asyncio (0.24.0) allows to control the scope
of async loop per fixture.  Don't need this workaround anymore.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
cac0257914 test.py: random_failures: make it play well with xdist
Pass random seed across xdist workers using env variable.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9bba59631f test.py: add xdist worker ID to log filenames
When run tests in parallel we need to ensure that filenames
are unique by adding xdist worker ID to them.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9cb0ec2b42 test.py: topology: run tests using bare pytest command
Run ScyllaClusterManager using pytest fixture if `--manager-api`
option is not provided.

On this stage we're trying to be as close to test.py as possible.
test.py runs tests file-by-file, so, effectively, scopes `session`,
`package`, and `module` are pretty same.  Also, test.py starts
ScyllaClusterManager for every test module and this is the reason
why fixture `manager_api_sock_path` has scope=`module`.  And, in
result, we need to change scope for fixture `manager_internal` too.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
42075170d1 test.py: add fixtures for current test suite and test
Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`
To build TestSuite object we need to discover a corresponding `suite.xml`
file.  Do this by walking up thru the fs tree starting from the current
test file.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
c4ae4e247a test.py: refactor paths constants and options
Add path constants to `test` module and use them in different test suites
instead of own dups of the same code:

 - TOP_SRC_DIR : ScyllaDB's source code root directory
 - TEST_DIR : the directory with test.py tests and libs
 - BUILD_DIR : directory with ScyllaDB's build artefacts

Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path
provided using `--tmpdir` CLI argument.  Don't use `tmpdir` name because it
mixed up with pytest's built-in fixture and `--tmpdir` option itself.

Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`

Refactor `ResourceGather*` classes to use path from a `test` object instead of
providing it separately.

Move modes constants to `test` module and remove duplications.

Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to
`pylib.suite.base` to avoid circular imports (with little refactoring to
use `pathlib.Path` instead of `str` as paths.)

Also, in some places refactor to use f-strings for formatting.
2025-03-30 03:19:29 +00:00
Michał Jadwiszczak
0ee0696959 test/cqlpy/test_service_level_api: update to service levels on raft and remove flakiness
Tests in `test_service_level_api` were written before
scylladb/scylladb#16585 and they were doing 10s sleeps to wait for
service level controller to update its configuration. Now performing
a read barrier is sufficient to ensure SL configuration is up-to-date,
which significantly reduces tests time (from ~60s to ~2-3s).

Moreover, there was flakiness in the `test_switch_tenants` test.
Until now, the test waited up to 60s for the connections to update
their scheduling groups. However, it is difficult to determine
how long the process might take because a connection may be blocked
while waiting for the next request to be processed,
and the scheduling group will be updated only after a request is processed
(see `generic_server::connection::process_until_tenant_switch()`).
To address this issue, 100 simple queries are executed so that
connections on all shards process at least one request
and update their scheduling groups.

Fixes scylladb/scylladb#22768

Closes scylladb/scylladb#23381
2025-03-28 17:14:21 +03:00
Pavel Emelyanov
9aa986a49a snapshot-ctl: Remove unused snapshot-single-table method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-28 10:45:31 +03:00
Pavel Emelyanov
5162f75d0b api: Snapshot all tables at once in scrub handler
The handler walks the list of tables and snapshots each one individually
(if needed). That's not very optimal, each such call starts a "snapshot
modification operation", which is switching to shard-0 for a lock, then
calls the snapshot of multiple tables giving it vector of a single name.
There's a method of snapshot-ctl that snapshots several tables at once,
no need to open-code it here.

One thing to care about -- the take_column_family_snapshot() throws when
the vector of table names is empty, so need an explicit skipping check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-28 10:44:47 +03:00
Avi Kivity
6d7cb68aab test: ldap: avoid io_uring Seastar reactor backend
It tends to fail sometimes with ENOMEM:

```
ERROR 2025-03-24 01:05:22,983 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found)
ERROR 2025-03-24 01:05:30,984 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found)
ERROR 2025-03-24 01:05:47,123 [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:12, Cannot allocate memory)
ERROR 2025-03-24 01:05:47,139 [shard 0:main] table - failed to write sstable /scylladir/testlog/x86_64/debug/scylla-33787f64/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/me-3got_1s5n_0lfls1y4z7vkkts07a-big-Data.db: storage_io_error (Storage I/O error: 12: Cannot allocate memory)
ERROR 2025-03-24 01:05:47,140 [shard 0:main] table - Memtable flush failed due to: storage_io_error (Storage I/O error: 12: Cannot allocate memory). Aborting, at 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514f14 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514b96 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x45165b1 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4518dcf 0x3fde842 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::shared_future<>::shared_state
Aborting on shard 0, in scheduling group main.
Backtrace:
  0x30f5605
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x384a0e4
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x3849db2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x369bd84
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d42a2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a5ed9
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a61d5
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a601f
  /lib64/libc.so.6+0x1a04f
  /lib64/libc.so.6+0x72b53
  /lib64/libc.so.6+0x19f9d
  /lib64/libc.so.6+0x1941
  0x3fde8b1
  0x35dc5c6
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674
  0x314b8b6
  /lib64/libc.so.6+0x70ba7
  /lib64/libc.so.6+0xf4b8b
=== TEST.PY SUMMARY START ===
Test exited with code -6

=== TEST.PY SUMMARY END ===

=== decoded ===
Backtrace:
[Backtrace #0]
__interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4369
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:70
seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:805
seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:838
seastar::print_with_backtrace(char const*, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:850
seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4004
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3981
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3976
/lib64/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c8c3fa52aaee3f5d73b6fd862e39e9d4c010b6ba, for GNU/Linux 3.2.0, not stripped

?? ??:0
printf_positional at ??:?
?? ??:0
?? ??:0
replica::table::seal_active_memtable(replica::compaction_group&, replica::flush_permit&&)::$_0::operator()(std::function<seastar::future<void> ()>) const at ././replica/table.cc:1512
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:122
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2616
seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3088
seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3256
seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3146
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167
seastar::testing::test_runner::start_thread(int, char**)::$_0::operator()() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/testing/test_runner.cc:77
void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char**)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
seastar::posix_thread::start_routine(void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/posix.cc:90
asan_thread_start(void*) at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/asan_interceptors.cpp:239
__vfscanf_internal at :?
peek_token at ??:?
```

In ce65164315, we banned io_uring from tests, but missed the ldap
tests. This extends coverage to ldap tests.

I verified that the new options indeed reach the test.

Refs #23411.

Credit to Botond for recognizing the failure reason.

Closes scylladb/scylladb#23422
2025-03-28 07:45:53 +02:00
Tomasz Grabiec
d6232a4f5f tablets: load_balancer: Fix reporting of total load per node
Load is now utilization, not count, so we should report average
per-shard load, which is equivalent to node's utilization.
2025-03-27 23:28:20 +01:00
Botond Dénes
bd8973a025 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395
2025-03-27 14:05:39 +02:00
Botond Dénes
d57e71837f Merge 'Improve scoped restore test' from Pavel Emelyanov
This PR includes several fixes to the nowadays flaky test_restore_with_streaming_scopes test.

1. Check that backup and restore APIs don't fail. Currently, if either of them does the test cases fails anyway checking that the data is not restored back, but it's better to know what exactly failed

2. For restore API the test collects the list of sstables to restore from. Currently collecting this list races with background compaction and sometimes leads to restore API to fail which, in turn, makes the whole test to fail

3. Add a test case that validates that restore-from-missing-sstable fails nicely

refs: #23189
No backport, as it's a relatively new test

Closes scylladb/scylladb#23445

* github.com:scylladb/scylladb:
  test/backup: Validate that restoring from non-existing sstables fails
  test/backup: Collect sstables names after snapshot
  test/backup: Check that backup and restore succeed
2025-03-27 13:23:41 +02:00
Piotr Dulikowski
288216a89e Merge 'Ignore wrapped exceptions gate_closed_exception and rpc::closed_error when node shuts down.' from Sergey Zolotukhin
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
Fixes scylladb/scylladb#23305
Fixes scylladb/scylladb#21815

Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.

Closes scylladb/scylladb#23336

* github.com:scylladb/scylladb:
  database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
  database: Unify exception handling in `do_apply` and `apply_with_commitlog`
  storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
  exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
2025-03-27 11:39:42 +01:00
Pavel Emelyanov
9f036d957a Merge 'test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables' from Botond Dénes
Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted.

Fixes: #23203

This PR fixes a flaky test which is only on master, no backports required.

Closes scylladb/scylladb#23450

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context
  test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables
2025-03-27 09:45:07 +03:00
Tomasz Grabiec
8e506c5a8f test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451
2025-03-27 09:44:07 +03:00
Lakshmi Narayanan Sreethar
dccce670c1 topology_coordinator: fix indentation in generate_migration_updates
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar
5b47d84399 topology_coordinator: do not schedule migrations when there are pending resize finalizations
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.

Also added a testcase to verify the fix.

Fixes #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar
8cabc66f07 load_balancer: make repair plans only when there is no pending resize finalization
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.

Refs #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Avi Kivity
b292b5800b Merge 'test.py: move starting LDAP service to dedicate method' from Andrei Chekun
Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services.
Fix LDAP tests flakiness due not possible to connect to LDAP server.
Add catching stdout and stderr of toxiproxy-cli in case of errors

Related: https://github.com/scylladb/scylladb/pull/23333

This PR is based on https://github.com/scylladb/scylladb/pull/23221, so #23221 should be merged first.

Closes scylladb/scylladb#23235

* github.com:scylladb/scylladb:
  test.py: Refactor nodetool/conftest
  test.py: Refactor test/pylib/cpp/ldap
  test.py: move starting LDAP service to dedicate method
2025-03-26 15:31:00 +02:00
Botond Dénes
801339bad9 test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context
To just system.local, the table these tests operate on. No need to
disable autocompaction for all of the system keyspace.
2025-03-26 09:19:38 -04:00
Botond Dénes
3ec863c4ce test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables
Filter out sstables which don't have a TOC or have a temporary TOC. Such
sstables are incomplete and can dissapear if the compaction which writes
them is interrupted.
2025-03-26 09:18:34 -04:00
Pavel Emelyanov
1da889f239 Merge 'Allow abort during join_cluster' from Benny Halevy
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

* requires backport on top of https://github.com/scylladb/scylladb/pull/23184

Closes scylladb/scylladb#23306

* github.com:scylladb/scylladb:
  main: allow abort during join_cluster
  main: add checkpoint before joining cluster
  storage_service: add start_sys_dist_ks
2025-03-26 15:48:58 +03:00
Sergey Zolotukhin
d448f3de77 database: Pass schema_ptr as const ref in wrap_commitlog_add_error 2025-03-26 11:15:26 +01:00
Sergey Zolotukhin
0d9d0fe60e database: Unify exception handling in do_apply and apply_with_commitlog
Move exception wrapping logic from `do_apply` and `apply_with_commitlog`
to `wrap_commitlog_add_error` to ensure consistent error handling.
2025-03-26 11:15:18 +01:00
Sergey Zolotukhin
b1e89246d4 storage_proxy: Ignore wrapped gate_closed_exception and rpc::closed_error when node shuts down.
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
2025-03-26 11:15:16 +01:00
Sergey Zolotukhin
6abfed9817 exceptions: Add try_catch_nested to universally handle nested exceptions of the same type. 2025-03-26 11:15:13 +01:00
Evgeniy Naydanov
574c81eac6 test.py: random_failures: deselect topology ops for some injections
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.

The solution is the same: deselect the tests.

Fixes #23302

Closes scylladb/scylladb#23405
2025-03-26 12:07:12 +03:00
Pavel Emelyanov
38f37763d6 test/backup: Validate that restoring from non-existing sstables fails
When restore API is called and is given a non-existing sstable (object
name) the task should complete with failed status and some meaningful
message in the error text.

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-26 10:55:42 +03:00
Pavel Emelyanov
02610a9072 test/backup: Collect sstables names after snapshot
The scoped restoer test works like this

- populate table
- flush it
- collect list of sstables
- take snapshot
- backup
- restore (with the list of sstables as argument)
- check the data is back

Steps 2 and 3 are racy -- in case compaction comes in the middle, the
list of collected sstables would differ from those snapshotted (and
backuped) which will later lead to restore failure due to missing
sstable.

Fix by collecting the list of sstables after taking snapshot, and
collect those not from the datadir, but from the snapshot dir.

fixes: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-26 10:40:54 +03:00
Pavel Emelyanov
08004fe470 test/backup: Check that backup and restore succeed
The scoped-restore test calls backup and restore APIs on several nodes,
but doesn't check if any of the operations actually succeeds. Sometimes
they indeed don't and test captures this, but in a weird manner -- the
post-test checks for data presense fails, because the expected data is
not in fact in its place.

It's more debugging-friendly if we know in advance if backup or restore
fails, rather than see that some data is missing after (failed) restore.

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-25 19:45:56 +03:00
Gleb Natapov
0aa4a82c83 messaging_service: do not call uninitialized _address_to_host_id_mapper std::function
During messaging_service object creation remove_rpc_client function may
be called if prefer_local snitch setting is true. The caller does not
provide host id, so _address_to_host_id_mapper is called to obtain it,
but at this point the function is not initialized yet.

The patch fixes the code to not call the function if not initialized.
This is not the problem since during messaging_service creation there
is no connection to drop.

Fixes: #23353

Message-ID: <Z-J2KbBK8NoFNYZZ@scylladb.com>
2025-03-25 18:41:16 +02:00
Wojciech Mitros
88d3fc68b5 alter_table_statement: fix renaming multiple columns in tables with views
When we rename columns in a table which has materialized views depending
on it, we need to also rename them in the materialized views' WHERE
clauses.
Currently, we do that by creating a new WHERE clause after each rename,
with the updated column. This is later converted to a mutation that
overwrites the WHERE clause. After multiple renames, we have multiple
mutations, each overwriting the WHERE clause with one column renamed.
As a result, the final WHERE clause is one of the modified clauses with
one column renamed.
Instead, we should prepare one new WHERE clause which includes all the
renamed columns. This patch accomplishes this by processing all the
column renames first, and only preparing the new view schema with the
new WHERE clause afterwards.

This patch also includes a test reproducer for this scenario.

Fixes scylladb/scylladb#22194

Closes scylladb/scylladb#23152
2025-03-25 09:58:58 +01:00
Benny Halevy
9fac0045d1 boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 15:39:53 +02:00
Benny Halevy
62aeba759b tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 15:32:16 +02:00
Benny Halevy
c62865df90 db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 14:54:45 +02:00
Michael Litvak
49b8cf2d1d storage_service: fix tablet split of materialized views
This fixes an issue where materialized view tablets are not split
because they are not registered as split candidates by the storage
service.

The code in storage_service::replicate_to_all_cores was changed in
4bfa3060d0 to handle normal tables and view tables separately, but with
that change register_tablet_split_candidate is applied only to normal
tables and not every table like before. We fix it by registering view
tables as well.

We add a test to verify that split of MV tables works.

Closes scylladb/scylladb#23335
2025-03-24 08:23:58 +01:00
Pavel Emelyanov
79b9626d16 Merge 'service: do not include unused headers ' from Kefu Chai
these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, updated the "iwyu.yaml" (short for include what you use) workflow to include "service" and "raft" subdirectories to prevent future regressions of including unused headers in them.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#23373

* github.com:scylladb/scylladb:
  .github: add "raft" and "service" subdirectories to CLEANER_DIR
  service: do not include unused headers
2025-03-24 10:20:15 +03:00
Avi Kivity
cc5fe542ed test: ignore unused fmt::to_string() result
fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't
interested in the result, so explicitly ignore it to avoid an error.

Closes scylladb/scylladb#23403
2025-03-24 10:19:09 +03:00
Avi Kivity
9d49c3254f install-dependencies.sh: disabiguate python magic package
There are in fact two python magic packages, file-magic (that binds
to libmagic and comes from the file package), magic, an independent
one. The name we use in install-depedencies.sh, python3-magic,
resolves to file-magic.

In Fedora 42, the resolution from the name python3-magic to
file-magic was removed [1], and so install-dependencies.sh now tries
to install the wrong magic package, which turns out not to coexist
with the one we want anyway.

Fix by naming python3-file-magic directly instead. Since this is what's
installed in the current frozen toolchain, there's no need to
regenerate it; we're just making the package list work in Fedora 42.

[1] 81910b7d88

Closes scylladb/scylladb#23402
2025-03-24 10:18:27 +03:00
Avi Kivity
cd04ab1a4e test: avoid spaces when defining user-defined literal operator
Clang 20 complains when it sees a user-defined literal operator
defined with a space before the underscore. Assume it's adhering
to the standard and comply.

Closes scylladb/scylladb#23401
2025-03-24 10:17:12 +03:00
Pavel Emelyanov
d436fb8045 Merge 'Fix EAR not applied on write to S3 (but on read).' from Calle Wilund
Fixes #23225
Fixes #23185

Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).

This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.

Unit tests for io objects and a macro test for S3 encrypted storage included.

Closes scylladb/scylladb#23261

* github.com:scylladb/scylladb:
  encryption: Add "wrap_sink" to encryption sstable extension
  encrypted_file_impl: Add encrypted_data_sink
  sstables::storage: Move wrapping sstable components to storage provider
  sstables::file_io_extension: Add a "wrap_sink" method.
  sstables::file_io_extension: Make sstable argument to "wrap" const
  utils: Add "io-wrappers", useful IO helper types
2025-03-24 10:12:46 +03:00
Artsiom Mishuta
8bb6414037 test.py: reuse clusters in Python suite
PR https://github.com/scylladb/scylladb/pull/22274 was introduced due to
CI instability and want to mark the cluster dirty after each test for topology
But in fact, affects only Python suites that are quite stable, and CI was
Stabilized by PR https://github.com/scylladb/scylladb/pull/22252

This PR get back cluster reusage in Python test suites

Closes scylladb/scylladb#23179
2025-03-23 20:08:36 +02:00
Kefu Chai
fdc5255eb8 build: disable DPDK for all release builds
Previously, DPDK was enabled by default in standard release builds but disabled
in "release-pgo" and "release-cs-pgo" builds. This inconsistency caused linking
warnings during PGO phase 2, when trained profiles from non-DPDK builds were
used with DPDK-enabled builds:

```
[1980/1983] LINK build/release/scylla
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor14run_some_tasksEv Hash = 2095857468992035112 up to 0 count discarded
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor6do_runEv Hash = 2184396189398169723 up to 50134372 count discarded
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar18syscall_work_queue11submit_itemESt10unique_ptrINS0_9work_itemESt14default_deleteIS2_EE Hash = 1533150042646546219 up to 1979931 count discarded
```

Since DPDK is not used in production and increases build time, this
change disables DPDK across all release build types. This both silences
the warnings and improves build performance.

Fixes #23323
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23391
2025-03-23 15:26:10 +02:00
Avi Kivity
9adfb91f46 Merge 'Introduce s3 data_source_impl for optimized object streaming' from Pavel Emelyanov
Currently, to stream data from sstable component the sstables code uses file_data_source_impl. In case the component is on S3, the s3::readable_file is put into that data source. The data source is configured with 128k buffers and at most 4 read-ahead-s. With that configuration, downloading full object from S3 becomes too slow -- GET-ing file with 128k requests is not nice even with 4 parallel read-ahead-s.

Better solution for S3 downloading is to request way larger chunk with one GET and then produce smaller, 128k or alike, buffers upon data arrival. This is what the newly introduced data source impl does -- it spawns a background GET and lets the upper input stream read buffers directly from the arriving body.

This PR doesn't yet make sstable layer use the new sink, just introduces it and adds unit and perf tests.

Testing

|Test|Download speed, MB/s|
|-|-|
|file_input_stream (*), 1 socket | 4.996|
|file_input_stream (*), 2 sockets | 9.403|
|s3_data_source (**) | 93.164|

(*) The file_input_stream test renders 128k GETs and is configured to issue at most 4 read-ahead-s
(**) The s3_data_source uses at most 1 socket regardless of what perf-test configures it to

refs: #22458

Closes scylladb/scylladb#22907

* github.com:scylladb/scylladb:
  test: Extend s3-perf test with stream download one
  test/perf: Tune-up s3 test options parsing
  test: Add unit test for newly introduced download source
  s3/client: Introduce data_source_impl for object downloading
  s3/client: Detach format_range_header() helper
2025-03-23 14:22:04 +02:00
Pavel Emelyanov
ca3b604afa test: Extend s3-perf test with stream download one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:07 +03:00
Pavel Emelyanov
283e8e0706 test/perf: Tune-up s3 test options parsing
Rename the `--upload bool` into `--operation string` one, so that new
tests can be added in the future. Also rename run_download() to
run_contiguous_get() because this is what the internals of this method
do -- just GET contiguous ranges sequentially.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:07 +03:00
Pavel Emelyanov
bd313c581f test: Add unit test for newly introduced download source
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Pavel Emelyanov
1f301b1c5d s3/client: Introduce data_source_impl for object downloading
The new data source implementation runs a single GET for the whole range
specified and lends the body input_stream for the upper input_stream's
get()-s. Eventually, getting the data from the body stream EOFs or
fails. In either case, the existing body is closed and a new GET is
spawn with the updater Range header so that not to include the bytes
read so far.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Pavel Emelyanov
d47719f70e s3/client: Detach format_range_header() helper
The get_object_contiguous() formats the 'bytes=X-Y' one for its GET
request. The very same code will be needed by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Avi Kivity
7646e1448a Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

Closes scylladb/scylladb#23138

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-20 19:10:36 +02:00
Paweł Zakrzewski
0d14177409 audit/syslog: escape quotes and add explicit section names
Before this change we outputted CSV-like structure, that looked like the
following:
Feb 27 12:31:30 scylla-audit: "10.200.200.41:0", "AUTH", "", "", "", "", "10.200.200.41:0", "cassandra", "false"

While this is passably readable for humans, the ordering of fields is
not clear and can be confusing. Furthermore, the `"` character (double
quote) was not escaped. This is not an issue for CQL, but will be a
problem for auditing Alternator, which will require logging JSON
payloads.

The new format will consist of key=value pairs and will escape the quote
character, making it easy to parse programmatically.
Feb 28 02:21:56 scylla-audit: node="10.200.200.41:0", category="AUTH", cl="", error="false", keyspace="", query="", client_ip="10.200.200.41:0", table="", username="cassandra"

This is required for the auditing alternator feature.

Closes scylladb/scylladb#23099
2025-03-20 19:55:51 +03:00
Calle Wilund
5c6337b887 encryption: Add "wrap_sink" to encryption sstable extension
Creates a more efficient data_sink wrapper for encrypted output
stream (S3).
2025-03-20 14:54:24 +00:00
Calle Wilund
9ac9813c62 encrypted_file_impl: Add encrypted_data_sink
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.

For making encrypted data sink writing less cumbersome.
2025-03-20 14:54:24 +00:00
Calle Wilund
e02be77af7 sstables::storage: Move wrapping sstable components to storage provider
Fixes #23225
Fixes #23185

Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.
2025-03-20 14:54:24 +00:00
Calle Wilund
d46dcbb769 sstables::file_io_extension: Add a "wrap_sink" method.
Similar to wrap file, should wrap a data_sink (used for
sstable writers), in obvious write-only, simple stream
mode.

Default impl will detect if we wrap files for this component,
and if so, generate a file wrapper for the input sink, wrap
this, and the wrap it in a file_data_sink_impl.

This is obviously not efficient, so extensions used in actual
non-test code should implement the method.
2025-03-20 14:54:22 +00:00
Calle Wilund
e100af5280 sstables::file_io_extension: Make sstable argument to "wrap" const
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.
2025-03-20 14:54:09 +00:00
Calle Wilund
98a6d0f79c utils: Add "io-wrappers", useful IO helper types
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.

For testing, and because they fit there, place memory
sink and source types there as well.
2025-03-20 14:54:09 +00:00
David Garcia
209ea2ea27 docs: update issues label
Closes scylladb/scylladb#23304
2025-03-20 17:46:58 +03:00
Kefu Chai
c37149d106 test: stop using seastar::at_exit()
seastar::at_exit() was marked deprecated recently. so let's use
the recommended approach to perform cleanups.

following tests were updated in this changes

- scylla perf-tablets: tested with
  scylla perf-tablets
- scylla perf-row-cache-update: tested with
  scylla perf-row-cache-update
- scylla perf-fast-forward: tested with
  scylla perf-fast-forward --populate --run-tests small-partition-skips \
    --smp 1
  scylla perf-fast-forward --run-tests small-partition-skips \
    --smp 1
- scylla perf-load-balancing: tested with
  scylla perf-load-balancing --nodes 3 --tablets1 16 --tablets2 16 --rf1 3 --rf2 3 --shards 16
- unit/row_cache_stress_test: tested with
  row_cache_stress_test --seconds 10
- perf/perf_cache_eviction: tested with
  ./perf_cache_eviction --seconds 1 --smp 1
- perf/perf_row_cache_reads: tested with
  ./perf_row_cache_reads

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23356
2025-03-20 17:44:57 +03:00
Ernest Zaslavsky
2fb5c7402e s3_client: Rearrange credentials providers chain
As the IAM role is not configured to assume a role at this moment, it
makes sense to move the instance metadata credentials provider up in
the chain. This avoids unnecessary network calls and prevents log
clutter caused by failure messages.

Closes scylladb/scylladb#23360
2025-03-20 17:43:04 +03:00
Pavel Emelyanov
23089e1387 Merge 'Enhance S3 client robustness' from Ernest Zaslavsky
This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include:

1. **Automatic Credential Renewal and Request Retry**: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors.
2. **Enhanced Exception Unwrapping**: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling.
3. **Expanded TLS Error Handling**: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations.

Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors.

No backport needed since it is an enhancement

Closes scylladb/scylladb#22150

* github.com:scylladb/scylladb:
  aws_error: Add GNU TLS codes
  s3_client: Handle nested std::system_error exceptions
  s3_client: Start using new retry strategy
  retry_strategy: Add custom retry strategy for S3 client
  retry_strategy: Make `should_retry` awaitable
2025-03-20 16:52:20 +03:00
Andrei Chekun
502b31d9c2 test.py: Refactor nodetool/conftest
Remove using method for finding root dir of the project and start using
the constant defined in package.
2025-03-20 11:41:30 +01:00
Andrei Chekun
1ea7b99385 test.py: Refactor test/pylib/cpp/ldap
Rename and move prepare_instance from ldap tests directory to
pylib/ldap_server.
2025-03-20 11:41:30 +01:00
Andrei Chekun
33e53565c4 test.py: move starting LDAP service to dedicate method
Move starting LDAP to the method where the rest of the services are
started. This will unify the way of starting the 3rd party services.
Fix LDAP tests flakiness due not possible to connect to LDAP server
Add catching stdout and stderr of toxiproxy-cli in case of errors
2025-03-20 11:37:04 +01:00
Pavel Emelyanov
339a849f13 transport: Remove connection::make_client_key()
It's effectively unused, there's one place where connection initializes
the client_data object using this helper, but that initialization looks
better without it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23321
2025-03-20 10:22:05 +01:00
Calle Wilund
5cc3fc4f14 cluster/test_encryption: bring test from enterprise (and enable)
Fixes scylladb/scylla-enterprise#5262

Part of the source-available code migration from scylla-enterprise.git
to scylla.git.

Original comment: topology_custom: add test_file_streaming_respects_encryption

Reproducer for issue scylladb/scylla-enterprise#4246.

Closes scylladb/scylladb#23320
2025-03-20 10:07:16 +02:00
Kefu Chai
ebf9125728 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359
2025-03-20 10:05:42 +02:00
Botond Dénes
d06bc27979 Merge 'Don't export string filenames from sstable' from Pavel Emelyanov
There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names.

Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names.

This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit.

refs: #14122 (previous ugly attempt to achieve the same goal)

Closes scylladb/scylladb#23194

* github.com:scylladb/scylladb:
  sstable: Remove unused malformed_sstable_exctpion(string filename)
  sstables: Make filename() return component_name
  sstables: Make file_writer keep component_name on board
  sstables: Make get_filename() return component_name
  sstables: Make toc_filename() return component_name
  sstables: Make sstable::index_filename() return component_name
  sstables: Introduce struct component_name
  sstables: Remove unused sstable::component_filenames() method
  sstables: Do not print component filenames on load-and-stream wrap-up
  sstables: Explicitly format prefix in S3 object name making
  sstables: Don't include directory name in exception
  sstables: Use fmt::format instead of string concatenation
  sstables: Rename filename($component) calls to ${component}_filename()
  sstables: Rename local filename variable to component_name
2025-03-20 09:51:03 +02:00
Kefu Chai
fd14a23aab .github: add "raft" and "service" subdirectories to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include
"raft" and "service" subdirectories to CLEANER_DIR, so that this
workflow can identify the regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-20 11:18:16 +08:00
Kefu Chai
b3e2561ed8 service: do not include unused headers
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-20 11:18:16 +08:00
Avi Kivity
a62ab824e6 schema: deprecate schema_extension
schema_extension allows making invisible changes to system_schema
that evade upgrade rollback tests. They appear in system_schema
as an encoded blob which reduces serviceability, as they cannot
be read.

Deprecate it and point users to adding explicit columns in scylla_tables.

We could probably make use of the data structure, after we teach it
to encode its payload into proper named and typed columns instead of
using IDL.

Closes scylladb/scylladb#23151
2025-03-19 20:36:16 +02:00
Kefu Chai
8fdaaf6491 service/storage_proxy: Improve digest comparison
Previously, the code used a find_if to compare each digest to the first
one to check for any mismatches. This was less readable. This change
replaces that with `std::ranges::all_of`, which checks if all elements
in the range are equal to the first digest, improving readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23332
2025-03-19 18:21:14 +03:00
Nadav Har'El
317de64281 test/alternator: enable debugging output during Python crashes
For a long time now, we've been seeing (see #17564), once in a while,
Alternator tests crashing with the Python process getting killed on
SIGSEGV after the tests have already finished successfully and all
pytest had to do is exit. We have not been able to figure out where the
bug is. Unfortunately, we've never been able to reproduce this bug
locally - and only rarely we see it in CI runs, and when it happens
we don't any information on why it happend.

So the goal of this patch is to print more information that might
hopefully help us next time we see this problem in CI (this patch
does NOT fix the bug). This patch adds to test/alternator's conftest.py
a call to faulthandler.enable(). This traps SIGSEGV and prints a stack
trace (for each thread, if there are several) showing what Python was
trying to do while it is crashing. Hopefully we'll see in this output
some specific cleanup function belonging to boto3 or urllib or whatever,
and be able to figure out where the bug is and how to avoid it.

We could have added this faulthandler.enable() call to the top-level
conftest.py or to test.py, but since we only ever had this Python
crash in Alternator tests, I think it is more suitable that we limit
this desperate debugging attempt only to Alternator tests.

Refs #17564

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23340
2025-03-19 18:18:51 +03:00
Dawid Mędrek
0e04a6f3eb main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300
2025-03-19 15:13:44 +01:00
Dawid Mędrek
41f862d7ba cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.

We provide two tests verifying that the changes work as intended.

Fixes scylladb/scylladb#23276
2025-03-19 14:51:47 +01:00
Dawid Mędrek
32879ec0d5 db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356
2025-03-19 14:46:35 +01:00
Pavel Emelyanov
6e7d6b06f0 api: Squash two parse_table_infos into one
There are currently three of them:
- one that works on query parameter value
- one that works on query parameters map
- one that works on the request itself

The second one is not used any longer by anyone by the third one, so
squash them together.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:53:38 +03:00
Pavel Emelyanov
851bd38953 api: Generalize keyspaces:tables parsing a little bit more
Continuation of the previous patch -- there's one caller that uses "non
standard" name for the tables query parameter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:52:54 +03:00
Pavel Emelyanov
dc3455bc55 api: Provide general pair<keyspace, vector<table>> parsing
Lots of API handlers get "keyspace" path parameter and parse the "cf"
query one into a vector of table_infos. Generalize those places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:51:57 +03:00
Pavel Emelyanov
722f282748 api: Remove ks_cf_func and related code
The type in question is used by two endpoint handlers that are called
with validated keyspace name and parsed vector of table_info-s. Both
handlers can parse what they need on their own, all the more so next
patches will make this parsing even more simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:49:55 +03:00
Pavel Emelyanov
73187a2e19 Merge 'mutation/mutation_consumer_concepts: simplify consumer hierarchy' from Botond Dénes
The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`.
This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files.
In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit.

This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts.
Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two.

Refactoring, no backport needed.

Closes scylladb/scylladb#23344

* github.com:scylladb/scylladb:
  mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
  mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
  test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
2025-03-19 15:43:00 +03:00
Pavel Emelyanov
a408a7abe1 sstable: Remove unused malformed_sstable_exctpion(string filename)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
f06cc32812 sstables: Make filename() return component_name
Similarly to toc_, index_ and data filenames, make the generic component
name getter return back not string, but a wrapper object. Most of
callers are log messages and exception generations. Other than that
there are tests, filesystem storage driver and few more places in
generic code who "know" that they work with real files, so make them use
explicit fmt::to_string().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
68c41f0459 sstables: Make file_writer keep component_name on board
The class in question is a wrapper around output_stream that writes,
flushes and closes the stream in async context. For logging it also
keeps the component filename on board, and now it's good time to patch
it and keep the component_filename instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
1ba91e28cb sstables: Make get_filename() return component_name
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include

- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests

these are all good to explicitly convert component_names to strings.

There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
0cdeed858c sstables: Make toc_filename() return component_name
Most of the callers use the returned value as log message parameter,
some construct malformed_sstable_exception that was prepared by previous
patch.

The remaining callers explicitly use fmt::to_string(), these are

- pending deletion log creation
- filesystem storage code
- tests
- stream-blob code that re-loads sstable

All but the last one are OK to use string toc name, the last one is not
very correct in its usage of toc_filename string, but it needs more care
to be fixed properly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
80e0030613 sstables: Make sstable::index_filename() return component_name
Most of the method callers use it as log parameter. There are few more
places that push it to malformed_sstable_exception, which immediately
converts it to string, so this patch makes the exception be constructed
with the component_name either.

And there's one more place that passes this string to file_writer
constructor. For now, convert it to string explicitly, but next patches
will fix that place to use pure component_name too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:01:23 +03:00
Pavel Emelyanov
dbb9ee15c1 sstables: Introduce struct component_name
The structure wraps const reference to sstable and component_name value
(it's an enum of several elements). It also has a formatter so that it
can be directly printed in logs (main usage) as well as converted to
strings (auxiliary and discourage usage).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
aba400f5d9 sstables: Remove unused sstable::component_filenames() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
24e5c30cc8 sstables: Do not print component filenames on load-and-stream wrap-up
When load-and-stream finishes it may call sstable::unlink() method to
drop the loaded (and streamed) sstable. Before calling it it prints a
log message about its intention that includes component_filenames()
vector. This log message is ugly in several ways.

First, it prints only recognized components, while unlink() method
unlinks all of them, so it's sort of misleading (it doesn't seem that
anyone ever read this message IRL though)

Next, that's the only place that is _that_ verbose about sstable
unlinking. "Common" unlinking paths don't print that much info.

Finally, the log message happen in debug level, so it's hardly ever
appears in any logs, but collecting several filenames takes time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
fb2bd91009 sstables: Explicitly format prefix in S3 object name making
Sometimes a component object name looks like
s3://bucket/prefix/component. For that the path formatting code formats
bucket name with the result of sstable->filename() invocation. This
patch changes it to format bucket name, prefix itself and
sstable->component_filename().

The change is idempotent, as sstable::filename() just concatenates prefix
with sstable::component_filename(). This change will help to remove the
former method from sstable soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
f212b5efa9 sstables: Don't include directory name in exception
When filesystem storage throws an exception about failure to create
components hardlinks, it includes three paths into it -- source file
name, destination file name and the directory name. The directory name
is excessive, source file name already has it. Also, this change will
make it possible to remove one of malformed_sstable_exception
constructors soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
a8bc81eb3c sstables: Use fmt::format instead of string concatenation
There are some places that concatentate filenames with something else to
get different filename (tool does it) or message for exception
(read_toc() helper). This patch uses fmt::format() instead to facilitate
future patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
dcc9167734 sstables: Rename filename($component) calls to ${component}_filename()
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
e6898a8854 sstables: Rename local filename variable to component_name
This is to be consistent with future changes and not to bloat them with
extra renames

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:20 +03:00
Kefu Chai
1ab2b7e7a0 tree: fix misspellings
these two misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23357
2025-03-19 09:13:20 +02:00
Botond Dénes
8f0d0daf53 Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.

Use topology_guard to maintain safety.

Fixes: https://github.com/scylladb/scylladb/issues/22408.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#22842

* github.com:scylladb/scylladb:
  test: add test to check concurrent tablets migration and repair
  repair: do not hold erm for repair scheduled by scheduler
  repair: get total rf based on current erm
  repair: make shard_repair_task_impl::erm private
  repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
  repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
  repair: pass session_id to repair_writer_impl::create_writer
  repair: keep materialized topology guard in shard_repair_task_impl
  repair: pass session_id to repair_meta
2025-03-19 08:55:24 +02:00
Kefu Chai
aca00118fb service: fix misspellings
these misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23334
2025-03-18 22:21:45 +02:00
Piotr Dulikowski
2ca1c0b6f9 Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.

The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.

The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.

We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.

The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.

For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.

The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.

Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from  `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.

In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.

This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.

Fixes scylladb/scylladb#20657

Closes scylladb/scylladb#22286

* github.com:scylladb/scylladb:
  test: mark tests with the gossip-based recovery procedure
  test: add tests for the Raft-based recovery procedure
  test: topology: util: fix the tokens consistency check for left nodes
  test: topology: util: extend start_writes
  gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
  raft_group0: modify_raft_voter_status: do not add new members
  treewide: allow recreating group 0 in the Raft-based recovery procedure
2025-03-18 19:10:56 +01:00
Yaron Kaikov
b375222408 ./github/scripts/auto-backport.py: don't remove backport label when backport process has an error
Today, when the `Fixes` prefix is missing or the developer is not a collaborator with `scylladbbot` we remove the backport labels to prevent the process from starting and notifying the developers.

Developers are worried that removing these backport labels will cause us to forget we need to do these backports. @nyh suggested to add a `scylladbbot/backport_error` label instead

Applied those changes, so when a `Fixes` prefix is missing we will add a `scylladbbot/backport_error` label and stop the process

When a user doesn't accept the invite we will still open the PR but he will not be assigned and will not be able to edit the branch when we have conflicts

Fixes: https://github.com/scylladb/scylla-pkg/issues/4898
Fixes: https://github.com/scylladb/scylla-pkg/issues/4897

Closes scylladb/scylladb#23259
2025-03-18 16:19:09 +02:00
Pavel Emelyanov
420b5bee20 test/s3: Increase boost/s3_test log levels
When something goes wrong, it's impossible to find anyting out without
s3 and http logs, so increase them for boost tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23245
2025-03-18 15:59:05 +02:00
Botond Dénes
a2d0d7b9a0 mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
FragmentConsumer[V2] also has no direct users, so fold it into
FlattenedConsumer[V2] as well. With this, FlattenedConsumer[V2] has a
nice and simple definition, with a single nesting level required due to
the return-type flexibility.
2025-03-18 09:24:49 -04:00
Botond Dénes
8768e2e08e mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
No code uses StreamedMutationConsumer[V2] directly, so let's take this
opportunity to reduce the jungle of consumer concepts.
2025-03-18 09:24:44 -04:00
Botond Dénes
969b07fdfd test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
The class actually implements the FlattenedConsumer, so fix the comment.
This eliminates the only reference to the StreamedMutationConsumer
concept.
2025-03-18 07:57:04 -04:00
Avi Kivity
9867129c7b Update seastar submodule
* seastar 412d058cf9...2f13c461bb (2):
  > smp: prefaulter: don't leave zombie worker threads
Fixes #23316
  > demos/tcp_sctp_server_demo:  Modernize with seastar::async and proper teardown

Closes scylladb/scylladb#23317
2025-03-18 13:36:05 +02:00
Botond Dénes
2795d83b32 Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund
Fixes #23017

When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed.

Now, eventually, this should be released, once we do deletion again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure.

As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available space.

Closes scylladb/scylladb#23150

* github.com:scylladb/scylladb:
  main/commitlog: wait for file deletion and distribute recycled segments to shards
  commitlog: Serialize file deletion
2025-03-18 11:47:17 +02:00
Avi Kivity
176bb464a2 github: error if we see #include "seastar/..."
Seastar is a system library from ScyllaDB's persepective and
so should use angle brackets for #include statements.

Closes scylladb/scylladb#23308
2025-03-17 21:56:48 +02:00
Ernest Zaslavsky
08b9e4d87b aws_error: Add GNU TLS codes
Add GNU TLS error codes to std::system_error handler since we can start getting these once they seep from seastar's http client
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
012f0e6d8c s3_client: Handle nested std::system_error exceptions
Enhance error handling by detecting and processing std::system_error exceptions
nested within std::nested_exception. This improvement ensures that system-level
errors wrapped in the exception chain are properly caught and managed, leading
to more robust error reporting and recovery.
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
367140a9c5 s3_client: Start using new retry strategy
* Previously, token expiration was considered a fatal error. With this change,
the `s3_client` uses new retry strategy that is trying to renew expired
creds
* Added related test to the `s3_proxy`
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
ed09614c27 retry_strategy: Add custom retry strategy for S3 client
Introduced a new retry strategy that extends the default implementation.
The should_retry method is overridden to handle a specific case for expired credential tokens.
When an expired token error is detected, the credentials are reset so it is expected that the client will re-authenticates, and the
original request is retried.
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
26062c65e4 retry_strategy: Make should_retry awaitable 2025-03-17 16:36:26 +02:00
Avi Kivity
0e4b303339 tools: toolchain: regenerate for python3-pytest-asyncio 0.24
Fixes a bug related to load_scope="module".

python-driver fixed to version 3.28.2, as it looks like
3.29.0 regressed TLS handling [1]. In any case tools/cqlsh
fixes it to 3.28.2.

Optimized clang from

 https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
 https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Ref #22960.

Fixes #23213

[1] https://github.com/scylladb/python-driver/issues/456

Closes scylladb/scylladb#23236
2025-03-17 15:41:55 +02:00
Botond Dénes
fda3486770 Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov
Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps.

There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps.

Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533

Closes scylladb/scylladb#23216

* github.com:scylladb/scylladb:
  api: Remove the remaining parse_tables() overload
  database: Sanitize flush_tables_on_all_shards()
  schema_tables: Remove all_table_names()
  database: Make tables flushing helper use table_info-s, not names
  api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
  schema_tables,client_state: Switch to using all_table_infos()
  schema_tables: Tune up some methods to benefit from table_infos
  schema_tables: Introduce all_table_infos()
2025-03-17 15:40:41 +02:00
Pavel Emelyanov
6217124d1d s3/client: Make "expected" reply status truly optional
Currently when a client::make_request() is called it can pass
std::optional<status> argument indicating which status it expects from
server. In case status doesn't match, the request body handler won't be
called, the request will fail with unexpected status exception.

However, disengaged expected implicitly means, that the requestor
expects the OK (200) status. This makes it impossible to make a query
which return status is not known in advance and it's up to the handler
to check it.

Lower level http client allows disengaged expected with the described
semantics -- handler will check status its own. This behavios for s3
client is needed for GET request. Server can respond with OK or partial
content status depending on the Range header. If the header is absent or
is large enough for the requested object to fit into it, the status
would be OK, if the object is "trimmed" the status is partial content.
In the end of the day, requestor cannot "guess" the returning status in
advance and should check it upon response arrival.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23243
2025-03-17 15:34:58 +02:00
Botond Dénes
afa305ffb4 Merge 'perf/perf_sstable: stop using at_exit() ' from Kefu Chai
`seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#23253

* github.com:scylladb/scylladb:
  perf/perf_sstable: fix the indent
  perf/perf_sstable: stop using at_exit()
2025-03-17 15:30:10 +02:00
Andrei Chekun
d68e54c26d test.py: Remove reuse cluster in cluster tests
Pool is not aware of the cluster configuration, so it can return cluster
to the test that is not suitable for it. Removing reuse will remove such
possibility, so there will be less flaky tests.

Closes scylladb/scylladb#23277
2025-03-17 15:27:59 +02:00
Calle Wilund
1525cb2dba main/commitlog: wait for file deletion and distribute recycled segments to shards
Refs #23017

When replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we
never release segments lower than limit). This is wasteful with diskspace.
But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available
space.

v2:
* Do segement distribution using ranges. go c++23
2025-03-17 12:09:00 +00:00
Calle Wilund
4ed81e05bf commitlog: Serialize file deletion
Fixes #23017

When deleting segments while our footprint is over the limit,
mainly when recycling/deleting segments after replay (recover
boot) we can cause two deletion passes to be running at the same
time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are
considering many files for delete/recycle, we can, due to task
switch, end up considering segments ok to keep, in parallel,
even though one of them should be deleted. The end result
will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion
again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might
cause some delay in processing cycles for recycle, but in
practice, this should never happen when we are in fact under
pressure.

Small unit test included.
2025-03-17 12:09:00 +00:00
Anna Stuchlik
cd61f60549 doc: fix product names in the 2025.1 upgrage guides
This commit fixes the product names in the upgrade 2025.1 guides so that:

- 6.2 is preceded with "ScyllaDB Open Source"
- 2024.x is preceded with "ScyllaDB Enterprise"
- 2025.1 is preceded with "ScyllaDB"

Fixes https://github.com/scylladb/scylladb/issues/23154

Closes scylladb/scylladb#23223
2025-03-17 13:54:11 +03:00
Anna Stuchlik
dbbf9e19e4 doc: remove the outdated info on seeds-info
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.

In addition, some clarification has been added to the existing sections.

Fixes https://github.com/scylladb/scylladb/issues/22400

Closes scylladb/scylladb#23282
2025-03-17 13:53:48 +03:00
Andrei Chekun
7423edb1f7 test.py: Increase verbosity of pytest
Currently, pytest truncates long objects in assertions.
This makes understanding the failure message difficult.
This will increase verbosity and pytest will stop truncating messages.

Closes scylladb/scylladb#23263
2025-03-17 12:51:41 +02:00
Aleksandra Martyniuk
20f9d7b6eb test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.
2025-03-17 10:37:03 +01:00
Aleksandra Martyniuk
5b792bdc98 repair: do not hold erm for repair scheduled by scheduler
Do not hold erm	for tablet repair scheduled by scheduler. Thanks to
that one tablet repair won't exclude migration of other tablets.

Concurrent repair and migration of the same tablet isn't possible,
since a tablet can be in one type of transition only at the time.
Hence the change is safe.

Refs: https://github.com/scylladb/scylladb/issues/22408.
2025-03-17 10:37:02 +01:00
Aleksandra Martyniuk
a1375896df repair: get total rf based on current erm
Get total rf based on erm. Currently, it does not change anything
because erm stays the same during the whole repair.
2025-03-17 10:36:18 +01:00
Aleksandra Martyniuk
34cd485553 repair: make shard_repair_task_impl::erm private
Make shard_repair_task_impl::erm private. Access it with getter.
2025-03-17 10:36:14 +01:00
Andrei Chekun
a20d848c01 test.py: Refactor test/conftest.py
Move functions responsible for preparation of the environment to the util file.
This is extracted from https://github.com/scylladb/scylladb/pull/22894 to make it easier to work together.

Closes scylladb/scylladb#23221
2025-03-17 11:31:00 +02:00
Avi Kivity
4416b0c732 treewide: use angle brackets for including seastar headers
Seastar is an external library, so we use angle brackets to
include its interfaces.

Closes scylladb/scylladb#23301
2025-03-17 10:03:06 +02:00
Andrei Chekun
1e1d213592 test.py: Remove additional report generation for python tests
Pytest is responsible for generation the report of the failed tests and
there is no need to generate it one more time

Closes scylladb/scylladb#23237
2025-03-17 09:36:08 +02:00
Kefu Chai
f8800b3f19 ent/encryption: rename "padd" to "padding"/"pad" and use structured bindings
Replace the abbreviated term "padd" with either "padding" or "pad" throughout
the encryption module. While "padd" was originally chosen to align with other
variable names ("type" and "mode"), using standard terminology improves code
readability and resolves codespell warnings.

Additionally, refactor relevant code to use C++ structured bindings for cleaner
implementation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23251
2025-03-17 09:23:42 +02:00
Raphael S. Carvalho
e9944f0b7c service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>

Closes scylladb/scylladb#23247
2025-03-16 22:45:00 +02:00
Pavel Emelyanov
95809a3ed1 Update seastar submodule
* seastar 5b95d1d7...412d058c (62):
  > fstream: Export functions for making file_data_source
  > build: Include DPDK dependency libraries in Seastar linkage
  > demos/tls_echo_server_demo: Modernize with seastar::async
  > http/client: Pass abort source by pointer
  > rpc: remove deprecated logging function support
  > github: Add Alpine Linux workflow to test builds with musl libc
  > exception_hacks: Make dl_iterate_phdr resolution manual
  > tests: relax test_file_system_space check for empty filesystems
  > demos/udp_server_demo:  Modernize with seastar::async and proper teardown
  > future: remove deprecated functions/concepts
  > util: logger: remove deprecated set_stdout_enabled and logger_ostream_type::{stdout,stderr}
  > memory: guard __GLIBC_PREREQ usage with __GLIBC__ check
  > scheduling_specific: Add noexcept wrapper for free()
  > file: Replace __gid_t with standard POSIX gid_t
  > aio_storage_context: Use reactor::do_at_exit()
  > json2code: support chunked_fifo
  > json: remove unused headers
  > httpd: test cases for streaming
  > build: use find_dependency() instead find_package() in config file
  > build: stop using a loop for finding dependencies
  > dns: Fix event processing to work safely with recent c-ares
  > tutorial: add a section about initialization and cleanup
  > reactor: deprecate at_exit()
  > httpclient: Add exception handling to connection::close
  > file: document max_length-limits for dma_read/write funcs taking vector<iovec>
  > build: fix P2582R1 detection in GCC compatibility check
  > json2code: optimize string handling using std::string_view
  > tests/unit: fix typo in test output
  > doc: Update documentation after removing build.sh
  > test: Add direct exception passing for awaits for perf test
  > github:  add Docker build verification workflow
  > docker: update LLVM debian repo for Ubuntu Orcular migration
  > tests/unit: Use http.HTTPStatus constants instead of raw status codes
  > tests/unit: Fix exception verification in json2code_test.py
  > httpd: handle streaming results in more handlers
  > json: stream_object now moves value
  > json: support for rvalue ranges
  > chunked_fifo: make copyable
  > reactor: deprecate at_destroy()
  > testing: prevent test scheduling after reactor exit
  > net: Add bytes sent/received metrics
  > net: switch rss_key_type to std::span instead of std::string_view
  > log: fixes for libc++ 19
  > sstring: fixes for lib++ 19
  > build: finalize numactl dependency removal
  > build: link DPDK against libnuma when detected during build
  > memory: remove libnuma dependency
  > treewide: replace assert with SEASTAR_ASSERT
  > future: fix typo in comment
  > http: Unwrap nested exceptions to handle retryable transport errors
  > net/ip, net: sed -i 's/to_ulong/to_uint/'
  > core: function_traits noexcept specializations
  > util/variant: seastar::visit forward value arg
  > net/tls: fix missing include
  > tls: Add a way to inspect peer certificate chain
  > websocket: Extract encode_base64() function
  > websocket: Rename wlogger to websocket_logger
  > websocket: Extract parts of server_connection usable for client
  > websocket: Rename connection to server_connection
  > websocket: Extract websocket parser to separate file
  > json2code_test: factor out query method
  > seastar-json2code: fix error handling

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23281
2025-03-16 21:57:43 +02:00
Benny Halevy
41f02c521d main: allow abort during join_cluster
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:21:15 +02:00
Benny Halevy
f269480f53 main: add checkpoint before joining cluster
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:08:04 +02:00
Benny Halevy
0fc196991a storage_service: add start_sys_dist_ks
Currently, there's a call to
`supervisor::notify("starting system distributed keyspace")`
which is misleading as it is identical to a similar
message in main() when starting the sharded service.

Change that to a storage_service log messages
and be more specific that the sys_dist_ks shards are started.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:05:23 +02:00
Jenkins Promoter
d84da3dc11 Update pgo profiles - x86_64 2025-03-15 04:57:28 +02:00
Jenkins Promoter
6e8e2ae333 Update pgo profiles - aarch64 2025-03-15 04:48:49 +02:00
Pavel Emelyanov
604fdd86e9 test: Count mutation fragments verbosily in scoped restore test
Sometimes after scoped restore a key is not found in nodes' mutation
fragments. This patch makes the counting more verbose to get better
understanding of what's going on in case of test failure

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23296
2025-03-14 21:31:36 +02:00
Pavel Emelyanov
bfbe802632 streaming: Relax load_sstable_for_tablet()
The method does several excessive things, that can be relaxed

1. In order to transfer a table-id to another shard, finds the table on
   source shard, gets schema and captures schema id on invoke_on()'s
   lambda. It can just capture the original table-id

2. In order to get sstable parameters (format, version, etc.) generates
   toc_filename(), then calls parse_path() to convert it into the
   entry_descriptor. The descriptor can be read from sstable directly.

3. Logging "success" includes target shard into the message, but happens
   on the source shard. The message can be just logged on target shard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23197
2025-03-14 15:26:48 +02:00
Botond Dénes
39bcf99f8e Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis
Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls).

This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded.

Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements:
```
* Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode
* Schema:
  CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false};
  CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck));
  CREATE INDEX t1_pk2_idx ON ks.t1(pk2);
* Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1;

+------------+-------------------+-------------------+
|  Page Size |      Master       |   Vector Limit    |
+============+===================+===================+
|            |   Latency (sec)   |   Latency (sec)   |
+------------+-------------------+-------------------+
|     100    |  5.80 ± 0.13      |  5.64 ± 0.10      |
+------------+-------------------+-------------------+
|    1000    |  4.77 ± 0.07      |  4.62 ± 0.06      |
+------------+-------------------+-------------------+
|    2000    |  4.67 ± 0.07      |  5.13 ± 0.03      |
+------------+-------------------+-------------------+
|    5000    |  4.82 ± 0.09      |  6.25 ± 0.06      |
+------------+-------------------+-------------------+
|   10000    |  4.89 ± 0.36      |  7.52 ± 0.13      |
+------------+-------------------+-------------------+
|     -1     |  4.90 ± 0.67      |  4.79 ± 0.33      |
+------------+-------------------+-------------------+
```
We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size.

Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude.

Finally, a unit test is added to reproduce the bug.

Fixes #18536.

The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport.

Closes scylladb/scylladb#22682

* github.com:scylladb/scylladb:
  cql3: secondary index: Limit page size for single-row partitions
  cql3: secondary index: Limit the size of partition range vectors
  cql3: untyped_result_set: Store rows in chunked_vector
  test: Reproduce bug with large allocations from secondary index
2025-03-14 15:06:07 +02:00
Botond Dénes
83ea1877ab Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky
scylla-sstable: Enable support for S3-stored sstables

Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532)

This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst

ref: https://github.com/scylladb/scylladb/issues/20532
fixes: https://github.com/scylladb/scylladb/issues/20535

No backport needed since the S3 functionality was never released

Closes scylladb/scylladb#22321

* github.com:scylladb/scylladb:
  tests: Add Tests for Scylla-SSTable S3 Functionality
  docs: Update Scylla Tools Documentation for S3 SSTable Support
  scylla-sstable: Enable Support for S3 SSTables
  s3: Implement S3 Fully Qualified Name Manipulation Functions
  object_storage: Refactor `object_storage.yaml` parsing logic
2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak
ca5c223505 test: mark tests with the gossip-based recovery procedure
This patch makes it clear which Raft recovery procedure is used in
each test.

Tests with "This test uses the gossip-based recovery procedure." are
the tests that use the gossip-based topology. This tests should be
deleted once we make the Raft-based topology mandatory.

Tests with the new FIXME are the tests that use the Raft-based
topology. They should be changed to use the Raft-based recovery
procedure or removed if they don't test anything important with
the new procedure.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4fd0e93154 test: add tests for the Raft-based recovery procedure 2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4e055882c1 test: topology: util: fix the tokens consistency check for left nodes
When we remove a node in the Raft-based topology
(by remove/replace/decommission), we remove its
tokens from `system.topology`, but we do not
change `num_tokens`. Hence, the old check could
fail for left nodes.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
d0efc77d20 test: topology: util: extend start_writes
We extend `start_writes` to allow:
- providing `ks_name` from the test,
- restarting it (by starting it again with the same `ks_name`),
- running it in the presence of shutdowns.

We use these features in a new test in one of the following patches.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
9970c1fcc3 gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
This patch ensures that members of the new group 0 can gossip with
members of the old group 0 during rolling restart in the Raft-based
recovery procedure. Without this change, restarted nodes (members of
the new group 0) wouldn't be marked as UP by other nodes (members of
the old group 0), which would decrease availability.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
3b9765dac8 raft_group0: modify_raft_voter_status: do not add new members
In the new Raft-based recovery procedure, we create a new group 0.
Dead nodes are not members of this group 0. Also, the removenode
handler makes a node being removed a non-voter. So, with the previous
implementation of `modify_raft_voter_status`, the node being removed
would become a non-voting member of the new group 0, which is very
weird. It should not cause problems, but we better avoid it and
keep the procedure clean.

This change also makes `modify_raft_voter_status` more intuitive in
general.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
fd51d7e448 treewide: allow recreating group 0 in the Raft-based recovery procedure
This patch adds support for recreating group 0 after losing
majority. This is the only part of the new Raft-based recovery
procedure that touches Scylla core.

The following steps are necessary to recreate group 0:
1. Determine the new group 0 members. These are alive nodes that
are normal or rebuilding.
2. Choose the recovery leader - the node which will become the
new group 0 leader. This must be one of the nodes with the
latest persistent group 0 state.
3. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
4. Set the new scylla.yaml parameter - `recovery_leader` - to Host
ID of the recovery leader on each live node.
5. Rolling restart all live nodes, but the recovery leader must be
restarted first.

In the implementation, restarts in step 5 are very similar to normal
restarts with the Raft-based topology enabled. The only differences
are:
1. Steps 3-4 make the restarting node discover the new group 0
in `join_cluster`.
2. The group 0 server is started in `join_group0`, not
`setup_group0_if_exists`.
3. The restarting node joins the new group 0 in `join_topology` using
`legacy_handshaker`. There is no reason to contact the topology
coordinator since the node has already joined the topology.

Unfortunately, this patch creates another execution path for the
starting logic. `join_cluster` becomes even messier. However, there
is nothing we can do about it. Joining group 0 without joining
topology is something completely new. Having a few small changes
without touching other execution paths is the best we can do.
We will start removing the old stuff soon, after making the
Raft-based topology mandatory, and the situation will improve.
2025-03-14 13:52:57 +01:00
Nadav Har'El
de7c1d526a test/cqlpy: test DESC doesn't list an index as a view
Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list
a secondary index as materialized view (the view used to back the index
in Scylla's implementation of secondary indexes). This patch adds a test
to verify that this issue no longer exists in server-side describe - so we
can mark the issue as fixed.

While preparing this test, I noticed that Scylla and Cassandra behave
differently on whether DESC TABLE should list materialized views or not,
so this patch also includes a test for that as well - and I opened
issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further
discuss that new issue.

Fixes #6058
Refs #23014.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23015
2025-03-14 14:40:19 +03:00
Nadav Har'El
c0821842de alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462
2025-03-14 14:03:15 +03:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Aleksandra Martyniuk
444c7eab90 repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream
does not access erm. Pass small_table_optimization_params containing erm
only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.
2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
e56bb5b6e2 repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
When small_table_optimization isn't enabled, flush_rows_in_working_row_buf
does not access erm. Add small_table_optimization_params containing erm and
pass it only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.
2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
09c74aa294 repair: pass session_id to repair_writer_impl::create_writer 2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
47bb9dcf78 repair: keep materialized topology guard in shard_repair_task_impl
Keep materialized topology guard in shard_repair_task_impl and check
it in check_in_abort_or_shutdown and before each range repair.
2025-03-14 10:41:10 +01:00
Aleksandra Martyniuk
928f92c780 repair: pass session_id to repair_meta
Pass session_id of tablet repair down the stack from the repair request
to repair_meta.

The session_id will be utiziled in the following patches.
2025-03-14 10:20:12 +01:00
Nadav Har'El
a72dde2ee6 test/cqlpy: add test for long table names
Scylla inherited a 48-character limit on the length of table (and
keyspace) names from Cassandra 3. It turns out that Cassandra 4 and
5 unintentionally dropped this limit (see history lesson in
CASSANDRA-20425), and now Cassandra accepts longer table names.
Some Cassandra users are using such longer names and disappointed
that Scylla doesn't allow them.

This patch includes tests for this feature. One test tries a
48-character table name - it passes on Scylla and all versions
of Cassandra. A second test tries a 100-character table name - this
one passes on Cassandra version 4 and above (but not on 3), and
fails on Scylla so marked "xfail". A third test tries a 500-character
table name. This one fails badly on Cassandra (see CASSANDRA-20389),
but passes on Scylla today. This test is important because we need to
be sure that it continues to pass on Scylla even after the Scylla is
fixed to allow the 100-character test.

Refs #4480 - an issue we already have about supporting longer names

Note on the test implementation:
Ideally, the test for a particular table-name length shouldn't just
create the table - it should also make sure we can write table to it
and flush it, i.e., that sstables can get written correctly. But in
practice, these complications are not needed, because in modern Scylla
it is the directory name which contains the table's name, and the
individual sstable files do not contain the table's name. Just creating
the table already creates the long directory name, so that is the part
that needs to be tested. If we created this directory successfully,
later creating the short-named sstables inside it can't fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23229
2025-03-14 11:15:07 +03:00
Kefu Chai
a82cfbecad test: perf_sstable: close frag_stream before destoying it
the underlying reader should be closed before being destroyed. otherwise
we'd have following failure when testing the "full_scan_streaming":

```
$ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming

ERROR 2025-03-13 15:04:26,321 [shard  0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit *.*:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774
------
   seastar::internal::coroutine_traits_base<double>::promise_type
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23270
2025-03-14 11:12:44 +03:00
Piotr Smaron
d365d9b2ad test/ldap: assign non-busy ports to ldap
It may happen that the ports we randomly choose for LDAP are busy, and
that'd fail the test suite, so once we randomly select ports, now we'll
see if they're busy or not, and if they're busy, we'll select next ones,
until we finally have some free ports for LDAP.
Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`:
before the fix, this command fails after ~112 runs, and of course it
passes with the fix.

Fixes: scylladb/scylla-enterprise#5120
Fixes: scylladb/scylladb#23149
Fixes: scylladb/scylladb#23242

Closes scylladb/scylladb#23275
2025-03-14 11:09:19 +03:00
Botond Dénes
68b2ac541c Merge 'streaming: fix the way a reason of streaming failure is determined' from Aleksandra Martyniuk
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

Closes scylladb/scylladb#22868

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-03-14 07:25:00 +02:00
Kefu Chai
31320399e8 test: sstable_test: use auto instead of statistics to avoid name collision
Replace explicit `statistics` type with `auto` in sstable_test to
resolve name collision. This addresses ambiguity introduced by commit
87c221cb which added `struct statistics` in
`seastar/include/seastar/net/api.hh`, conflicting with the existing
definition in `scylladb/sstables/types.hh` when the `seastar` namespace
is opened.

The `auto` keyword avoids the need to explicitly reference either type,
cleanly resolving the collision while maintaining functionality.
This change prepares for the upcoming change to bump up seastar
submodule.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23249
2025-03-13 22:51:21 +02:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Kefu Chai
5eba29e376 ent/encryption: correct misspellings
these misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23254
2025-03-13 13:07:34 +02:00
Kefu Chai
9f411f9962 tools/scylla-nodetool: refactor to use std::tie() for cleaner code
Replace explicit pair member access with std::tie() throughout
scylla-nodetool. This simplifies the code by eliminating repetitive
pair.first/pair.second references and makes the codebase more
maintainable and readable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23250
2025-03-13 11:56:07 +02:00
Dawid Mędrek
0a6137218a db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811
2025-03-13 11:55:15 +02:00
Paweł Zakrzewski
d483051e44 cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present
Before this patch we silently allowed and ignored PER PARTITION LIMIT.
While using aggregate functions in conjunction with PER PARTITION LIMIT
can make sense, we want to disable it until we can offer proper
implementation, see #9879 for discussion.

We want to match Cassandra, and for queries with aggregate functions it
behaves as follows:
- it silently ignores PER PARTITION LIMIT if GROUP BY is present, which
  matches our previous implementation.
- rejects PER PARTITION LIMIT when GROUP BY is *not* present.

This patch adds rejection of the second group.

Fixes #9879

Closes scylladb/scylladb#23086
2025-03-13 10:29:53 +02:00
Pavel Emelyanov
f50bcbf4d0 test/perf/s3: Don't forget to stop sharded<tester> on error
In case invoke_on_all(tester::start) throws, the sharded<tester>
instance remains non-stopped and calltrace is reported on test stop. Not
nice, fix it so that sharded<> thing is stopped in any case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23244
2025-03-13 09:54:09 +02:00
Anna Stuchlik
562b5db5b8 doc: Remove "experimental" from ALTER KEYSPACE with Tablets
Altering a keyspace with tablets is no longer experimental.
This commit removes the "Experimental" label from the feature.

Fixes https://github.com/scylladb/scylladb/issues/23166

Closes scylladb/scylladb#23183
2025-03-12 17:41:36 +02:00
Kefu Chai
68fc067106 perf/perf_sstable: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Kefu Chai
4f62f79622 perf/perf_sstable: stop using at_exit()
seastar::at_exit() was marked deprecated recently. so let's use
the recommended approach to perform cleanups.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Nadav Har'El
3ca2e6ddda Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky
Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes:

- Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS.

- Isolated the retryable_http_client into its own file, improving clarity and maintainability.

- Added a make_request method that introduces a response-skipping handler.

- Introduced a custom error handler constructor, providing greater flexibility in handling errors.

- Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability.

- Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations.

- Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations.

- Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks.

- Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR.

No backport needed since we dont have any s3 related activity on the scylla side been released

Closes scylladb/scylladb#21933

* github.com:scylladb/scylladb:
  s3_client: Adjust Log Severity in Retry Strategy
  aws_error: Enhance error handling for AWS HTTP client
  aws_error: Add STS specific error handling
  credentials_providers: Close retryable clients in Credentials Providers
  credentials_providers: Integrate retryable_http_client with Credentials Providers
  s3_client: enhance `retryable_http_client` functionality
  s3_client: isolate `retryable_http_client`
  s3_client: Prepare for `retryable_http_client` relocation
  s3_client: Remove `is_redirect_status` function
  s3_client: Move retryable functionality out of s3 client
2025-03-12 10:19:15 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
57f2b6d825 gossiper: drop unneeded code
host_id is already available at this point.
2025-03-11 12:09:22 +02:00
Gleb Natapov
cca228265e gossiper: move _expire_time_endpoint_map to host_id
Index _expire_time_endpoint_map map by host id instead of ip
2025-03-11 12:09:22 +02:00
Gleb Natapov
c45b50bbe6 gossiper: move _just_removed_endpoints to host id
Index _just_removed_endpoints map by host id instead of ip
2025-03-11 12:09:22 +02:00
Gleb Natapov
22739bb39a gossiper: drop unused get_msg_addr function 2025-03-11 12:09:22 +02:00
Gleb Natapov
b3720b80b6 messaging_service: change connection dropping notification to pass host id only
Only host id is needed in the callback anyway.
2025-03-11 12:09:22 +02:00
Gleb Natapov
24d30073f9 messaging_service: pass host id to remove_rpc_client in down notification
Do not iterate over all client indexed by hos id to search for those
with given IP.  Look up by host id directly since now we know it in down
notification. In cases host id is not known look it up by ip.
2025-03-11 12:09:22 +02:00
Gleb Natapov
4ca627b533 treewide: pass host id to endpoint_lifecycle_subscriber 2025-03-11 12:09:22 +02:00
Gleb Natapov
8a747fbc2a treewide: drop endpoint life cycle subscribers that do nothing
Provide default implementation for them instead. Will be easier to rework them later.
2025-03-11 12:09:22 +02:00
Gleb Natapov
525b88f877 load_meter: move to host id
Use host id indexing in load_meter and only convert to ips on api level.
2025-03-11 12:09:22 +02:00
Gleb Natapov
48a1030c91 treewide: use host id directly in endpoint state change subscribers
Now that we have host ids in endpoint state change subscribers some of
them can be simplified by using the id directly instead of locking it up
by ip.
2025-03-11 12:09:22 +02:00
Gleb Natapov
499eb4d17f treewide: pass host id to endpoint state change subscribers 2025-03-11 12:09:22 +02:00
Gleb Natapov
eb59205caf gossiper: drop deprecated unsafe_assassinate_endpoint operation
It was always deprecated.
2025-03-11 12:09:21 +02:00
Gleb Natapov
c17a8b4a76 storage_service: drop unused code in handle_state_removed 2025-03-11 12:09:21 +02:00
Gleb Natapov
696aee3adc treewide: drop endpoint state change subscribers that do nothing
Provide default implementation for them instead. Will be easier to rework them later.
2025-03-11 12:09:21 +02:00
Gleb Natapov
7dcffda6bd gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory 2025-03-11 12:09:21 +02:00
Gleb Natapov
8425c26462 gossiper: start using host ids to send messages earlier
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
2025-03-11 12:09:21 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Gleb Natapov
c3035caeb5 topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
Currently sync_raft_topology_nodes() only send join notification if a
node is new in the topology, but sometimes a node changes IP and the
join notification should be send for the new IP as well. Usually it is
done from ip_address_updater, but topology reload can run first and then
the notification will be missed. The solution is to send notification
during topology reload as well.
2025-03-11 12:09:21 +02:00
Gleb Natapov
0e3dcb7954 treewide: move everyone to use host id based gossiper::is_alive and drop ip based one 2025-03-11 12:09:21 +02:00
Gleb Natapov
56c6e04079 storage_proxy: drop unused template
The storage_proxy::is_alive is called with host_id only.
2025-03-11 12:09:21 +02:00
Gleb Natapov
e47f251178 gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id
Index live and dead endpoints by host id. It also allows to simplify
some code that does a translation.
2025-03-11 12:09:21 +02:00
Gleb Natapov
6f05608b5e gossiper: chunk vector using std::views::chunk instead of explicitly code it 2025-03-11 12:09:21 +02:00
Gleb Natapov
0437f558cd idl: generate ip based version of a verb only for verbs that need it
The patch adds new marker for a verb - [[ip]] that means that for this
verb ip version of the verbs needs to be generated. Most of the verbs
do not need it.
2025-03-11 12:09:21 +02:00
Gleb Natapov
3734afe8a5 gossiper: send shutdown notification by host id 2025-03-11 12:09:21 +02:00
Gleb Natapov
ee59baf6fc gossiper: drop old shadow round code
It is no longer used. It was replaced with explicit GOSSIP_GET_ENDPOINT_STATES verb in
cd7d64f588 which is in scylla-4.3.0
2025-03-11 12:09:20 +02:00
Gleb Natapov
f1a82c1d01 gossiper: drop unused get_endpoint_states function 2025-03-11 12:09:20 +02:00
Gleb Natapov
c4a0fbae16 gossiper: check id match inside force_remove_endpoint
Before calling force_remove_endpoint (which works on ip) the code checks
that the ip maps to the correct id (not not remove a new node that
inherited this ip by  mistake). Move the check to the function itself.
2025-03-11 12:09:20 +02:00
Gleb Natapov
52c9217f1b migration_manager: drop unneeded id to ip translation 2025-03-11 12:09:20 +02:00
Gleb Natapov
4420ddaf86 gossiper: move is_gossip_only_member and its users to work on host id 2025-03-11 12:09:20 +02:00
Gleb Natapov
cb2b874942 table: use host id based get_endpoint_state_ptr and skip id->ip translation 2025-03-11 12:09:20 +02:00
Gleb Natapov
2746d391af gossiper: do not ping outdated address
A node may change its IP but some other node in the cluster may still
try to ping it using an old IP because it may receive an outdated gossiper
entry with the old IP. Do not send echo message to the old IP. It will
cause a misusing UP message with old address to be printed.
2025-03-11 12:09:20 +02:00
Gleb Natapov
aaba55073d storage_service: drop outdated code that checks whether raft topology should be used
After raft_topology_change_enabled() was introduced the code does
nothing useful. The function is responsible for the decision if raft topology
is enabled or not.
2025-03-11 12:09:20 +02:00
Gleb Natapov
6952f62869 gossiper: drop unused field from loaded_endpoint_state 2025-03-11 12:09:20 +02:00
Nikos Dragazis
7a6a4f54a5 cql3: secondary index: Limit page size for single-row partitions
The size of the partition range vector was constrained in the previous
patch. Any rows beyond the vector's capacity are discarded.

In the special case of single-row partitions, we know the size of each
partition, so we can enforce this limit on the query itself via the page
size.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:49 +02:00
Nikos Dragazis
76b31a3acc cql3: secondary index: Limit the size of partition range vectors
The partition range vector is an std::vector, which means it performs
contiguous allocations. Large allocations are known to cause problems
(e.g., reactor stalls).

For paged queries, limit the vector size to 1000. If more partition keys
are available in the query result, discard them. Ideally, we should not
be fetching them at all, but this is not possible without knowing the
size of each partition.

Currently, each vector element is 120 bytes and the standard allocator's
max preferred contiguous allocation is 128KiB. Therefore, the chosen
value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000).
This should be good enough for most cases. Since secondary index queries
involve one base table query per partition key, these queries are slow.
A higher limit would only make them slower and increase the probability
of a timeout. For the same reason, saving a follow-up paged request from
the client would not increase the efficiency much.

For unpaged queries, do not apply any limit. This means they remain
susceptible to stalls, but unpaged queries are considered unoptimized
anyway.

Finally, update the unit test reproducer since the bug is now fixed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:42 +02:00
Pavel Emelyanov
db70c7bbf7 api: Remove the remaining parse_tables() overload
There's only one caller of it left -- the scrub handler. It can use the
parse_table_infos() one and get table names from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:14:10 +03:00
Pavel Emelyanov
89f3c1a91e database: Sanitize flush_tables_on_all_shards()
Previous patch left this method with few uglinesses

- the vector<table_id> argument is named table_names
- the sstring keyspace argument is unused
- the keyspace argument is captured for no use

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:13:10 +03:00
Pavel Emelyanov
0f9cc956f4 schema_tables: Remove all_table_names()
Now it's unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:12:56 +03:00
Pavel Emelyanov
c2d23d7948 database: Make tables flushing helper use table_info-s, not names
The database::flush_tables_on_all_shards() method accepts a keyspace
name and a vector of table names. Then it converts ks:cf pair for each
of the table name into a table-id and flushes the table with the ID.

All the callers of that method already have or can easily get the vector
of table_id-s, not just names, so make use of this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:11:32 +03:00
Pavel Emelyanov
e94dce1725 api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
Currently the handler in question calls parse_tables() which returns
empty list of tables in the "cf" parameter is missing, or the table
names if it's present. In the former case the handler will call
flush_keyspace_on_all_shards() that just gets all table names from the
keyspace and flushes them all.

This change makes the handler use parse_table_infos() which is different
-- when the "cf" parameter is missing, it gets all tables from the
keyspace. So the handler no longer need to call the keyspace flush, it
can always call the "flush the list of tables" helper.

With that change one of the parse_tables() helpers becomes unused, so
remove it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:06:55 +03:00
Pavel Emelyanov
5a897d7368 schema_tables,client_state: Switch to using all_table_infos()
There are few more places left that can use all_table_infos() as a
replacement for all_table_names(), patch them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:05:59 +03:00
Pavel Emelyanov
da05765746 schema_tables: Tune up some methods to benefit from table_infos
There are convert_schema_to_mutations() and calculate_schema_digest()
that collect table names and then use them to find schema and query
mutations from the table.

Both can use the newly introduced all_table_infos() and use the returned
table_id-s to do the same, thus avoiding re-lookups (which are fast
anyway, but still).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:01:50 +03:00
Pavel Emelyanov
d7bfa5a545 schema_tables: Introduce all_table_infos()
This method is like all_table_names(), but returns a vector of
table_info-s which is effectively a pair of string name and uuid id.
To be used later, and the string-returning all_table_name() will be
removed very soon too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 12:59:03 +03:00
Ernest Zaslavsky
c8de7619e5 s3_client: Adjust Log Severity in Retry Strategy
* Reduced log severity in retry_strategy.
* Rationale: SCT fails tests when any message is logged as ERROR.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
8e46929474 aws_error: Enhance error handling for AWS HTTP client
- Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues.
- Update error handling to correctly capture and process all exceptions from Seastar's HTTP client.
- Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked.
- Now, all exceptions trigger the appropriate retry logic per the intended strategy.
- Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
92a12c96a2 aws_error: Add STS specific error handling
Updated the AWS error list to include handling for errors specific to the STS service. This enhancement ensures more comprehensive error management for STS-related operations.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
a371d6cf62 credentials_providers: Close retryable clients in Credentials Providers
Updated `instance_profile_credentials_provider` and `sts_assume_role_credentials_provider` to close the HTTP client appropriately.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
45a6e88954 credentials_providers: Integrate retryable_http_client with Credentials Providers
* Updated STS and Instance Metadata Service credentials providers to utilize retryable_http_client.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
7c49ee4520 s3_client: enhance retryable_http_client functionality
Enhanced `retryable_http_client` by allowing the injection of a custom error handler through its constructor.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
b589a882bb s3_client: isolate retryable_http_client
Relocated `retryable_http_client` into its own dedicated file for improved clarity and maintainability.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
5eff83af95 s3_client: Prepare for retryable_http_client relocation
Expose `map_s3_client_exception` outside the S3 client class to facilitate moving `retryable_http_client` to a separate file.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
2b3abba10a s3_client: Remove is_redirect_status function
Eliminate the `is_redirect_status` function in favor of the equivalent functionality provided by Seastar's HTTP client.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
5b7d4a4136 s3_client: Move retryable functionality out of s3 client
This commit moves the retryable HTTP client functionality out of the S3 client implementation. Since this functionality is also required for other services, such as AWS STS, it has been separated to ensure broader applicability.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
050c3cdbc2 tests: Add Tests for Scylla-SSTable S3 Functionality
Extended existing Scylla Tools tests to cover the new functionality of
reading SSTables from S3. This ensures that the new S3 integration is
thoroughly tested and performs as expected.
2025-03-09 10:17:48 +02:00
Ernest Zaslavsky
112b4c8764 docs: Update Scylla Tools Documentation for S3 SSTable Support
Updated the Scylla Tools documentation to include changes related to
the enhanced support for S3-stored SSTables. This update ensures that
the documentation accurately reflects the latest functionality and
improvements.
2025-03-09 09:50:37 +02:00
Ernest Zaslavsky
17e3c01f4e scylla-sstable: Enable Support for S3 SSTables
Configure the sstable manager to correctly handle storage options based
on the input type (local or S3-stored sstables). This tweak allows for
mixing both storage types within a single call, improving flexibility
and functionality.
2025-03-09 09:50:36 +02:00
Ernest Zaslavsky
88c4fa6569 s3: Implement S3 Fully Qualified Name Manipulation Functions
Added utility functions to handle S3 Fully Qualified Names (FQN). These
functions enable parsing, splitting, and identification of S3 paths,
enhancing our ability to work with S3 object storage more effectively.
2025-03-09 09:50:36 +02:00
Ernest Zaslavsky
38165fd285 object_storage: Refactor object_storage.yaml parsing logic
Refactored the parsing of `object_storage.yaml` out of Scylla's `main`
function. This change is made to facilitate reusability of the parsing
logic in other parts of the codebase.
2025-03-09 09:50:36 +02:00
Vlad Zolotarov
f7e1695068 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.
2025-03-06 09:30:51 -05:00
Aleksandra Martyniuk
35bc1fe276 streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
2025-03-06 15:07:14 +01:00
Aleksandra Martyniuk
44748d624d streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.
2025-03-06 15:07:09 +01:00
Tomasz Grabiec
c4714180cc tablets: Make load balancing capacity-aware
Before this patch the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogenous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assummes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
2025-03-06 13:35:38 +01:00
Tomasz Grabiec
3c0b733943 topology_coordinator: Fix confusing log message
There can be other reasons the plan is empty, tablets may not actually
be balanced. For example, capacity for all the nodes may not be known,
or nodes may be down.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
40414c4985 topology_coordinator: Refresh load stats after adding a new node
Stats are refreshed every minute by default. Load balancing cannot
happen without capacity information for all normal nodes. To avoid the
delay, trigger refresh after adding a new node.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d6f8810e66 topology_coordinator: Allow capacity stats to be refreshed with some nodes down
With capacity-aware balancing, if we're missing capacity for a normal
node, we won't be able to proceed with tablet drain. Consider the
following scenario:

1. Nodes: A, B
2. refresh stats with A and B
3. Add node C
4. Node B goes down
5. removenode B starts
6. stats refreshing fails because B is down

If we don't have capacity stats for node C, load balancer cannot make
decisions and removenode is blocked indefinitely. A reproducer is
added in this patch.

To alleviate that, we allow capacity stats to be collected for nodes
which are reachable, we just don't update the table size part.

To keep table stats monotonic, we cache previous results per node, so
even if it's unreachable now, we use its last reported sizes. It's
still more accurate than not refreshing stats at all. A node can be
down for a long period, and other replicas can grow in size. It's not
perfect, because the stale node can skew the stats in its direction,
but ignoring it completely has its pitfalls too. Better solution is
left for later.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
af3dce4c8a topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
Use serialized_action for serialization and batching.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
69c49fb1a7 test: boost: tablets_test: Always provide capacity in load_stats
Move shared_load_stats to topology_builder.hh so that topology_builder
can maintain it. It will set capacity for all created nodes. Needed
after load balancer requires capacity to make decisions.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
dfc9101dfd test: perf_load_balancing: Set node capacity
Otherwise, load balancer will not make any plan once it becomes
capacity-aware.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
6169401dbc test: perf_load_balancing: Convert to topology_builder
The test no longer worked becuase load balancer requires proper schema
in the database now. Convert to topology_builder which builds topology
in the database and create schema in the database (which needs proper
topology).
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Vlad Zolotarov
ca6bddef35 transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173
2025-03-05 20:37:37 -05:00
Aleksandra Martyniuk
faf3aa13db streaming: use streaming namespace in table_check.{cc,hh} 2025-03-05 11:00:03 +01:00
Aleksandra Martyniuk
876cf32e9d repair: streaming: move table_check.{cc,hh} to streaming 2025-03-05 11:00:03 +01:00
Benny Halevy
8ae8275f17 main: stop system keyspace
To prevent internal queries coming
from system_keyspace (like updating compaction history,
for example)

Refs scylladb/scylla-dtest#5581
Refs #22886
Refs #8995

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
102aec64d5 system_keyspace: shutdown: allow calling more than once
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:22 +02:00
Benny Halevy
fba88bdd62 database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
To allow safe plug and unplug of the system_keyspace.

This patch follows-up on 917fdb9e53
(more specifically - f9b57df471)
Since just keeping a shared_ptr<system_keyspace> doesn't prevent
stopping the system_keyspace shards, while using the `pluggable`
interface allows safe draining of outstanding async calls
on shutdown, before stopping the system_keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:27:23 +02:00
Benny Halevy
13a22cb6fd utils: add class pluggable
A wrapper around a shared service allowing
safe plug and unplug of the service from its user
using a phased-barrier operation permit guarding
the service while in use.

Also add a unit test for this class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:25:50 +02:00
Nikos Dragazis
03902e5f17 cql3: untyped_result_set: Store rows in chunked_vector
The `untyped_result_set` stores rows in std::vector.

Switch to `chunked_vector` to prevent large allocations and data copies.
One such case is in secondary index queries, where we convert the result
of the internal index view query into an `untyped_result_set` for
processing. The result is bound by the page size memory limit (1MiB by
default), so it can cause large allocations of this magnitude.

This patch aligns `untyped_result_set` with `result_set`, which also
uses a `chunked_vector`.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-04 18:39:32 +02:00
Nikos Dragazis
892690b953 test: Reproduce bug with large allocations from secondary index
Secondary index queries which fetch partitions from the base table can
cause large allocations that can lead to reactor stalls.

Reproduce this with a unit test that runs an indexed query on a table
with thousands of single-row partitions, and checks the memory stats for
any large contiguous allocations.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-04 18:39:28 +02:00
1433 changed files with 78602 additions and 22380 deletions

14
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @nuivall @ptrsmrn @KrzaQ
auth/* @nuivall @ptrsmrn
# CACHE
row_cache* @tgrabiec
@@ -25,15 +25,15 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @nuivall @ptrsmrn @KrzaQ
cql3/* @tgrabiec @nuivall @ptrsmrn
# COUNTERS
counters* @nuivall @ptrsmrn @KrzaQ
tests/counter_test* @nuivall @ptrsmrn @KrzaQ
counters* @nuivall @ptrsmrn
tests/counter_test* @nuivall @ptrsmrn
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh @nuivall @ptrsmrn @KrzaQ
docs/alternator @annastuchlik @tzach @nyh
# GOSSIP
gms/* @tgrabiec @asias @kbr-scylla
@@ -74,8 +74,8 @@ streaming/* @tgrabiec @asias
service/storage_service.* @tgrabiec @asias
# ALTERNATOR
alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
test/alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
alternator/* @nyh
test/alternator/* @nyh
# HINTED HANDOFF
db/hints/* @piodul @vladzcloudius @eliransin

View File

@@ -1,15 +1,86 @@
This is Scylla's bug tracker, to be used for reporting bugs only.
name: "Report a bug"
description: "File a bug report."
title: "[Bug]: "
type: "bug"
labels: bug
body:
- type: checkboxes
id: terms
attributes:
label: Code of Conduct
description: "This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
our forum at https://forum.scylladb.com/ or in our slack channel https://slack.scylladb.com/ "
options:
- label: I have read the disclaimer above and am reporting a suspected malfunction in Scylla.
required: true
- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.
*Installation details*
Scylla version (or git commit hash):
Cluster size:
OS (RHEL/CentOS/Ubuntu/AWS AMI):
*Hardware details (for performance issues)* Delete if unneeded
Platform (physical/VM/cloud instance type/docker):
Hardware: sockets= cores= hyperthreading= memory=
Disks: (SSD/HDD, count)
- type: input
id: product-version
attributes:
label: product version
description: Scylla version (or git commit hash)
placeholder: ex. scylla-6.1.1
validations:
required: true
- type: input
id: cluster-size
attributes:
label: Cluster Size
validations:
required: true
- type: input
id: os
attributes:
label: OS
placeholder: RHEL/CentOS/Ubuntu/AWS AMI
validations:
required: true
- type: textarea
id: additional-data
attributes:
label: Additional Environmental Data
#description:
placeholder: Add additional data
value: "Platform (physical/VM/cloud instance type/docker):\n
Hardware: sockets= cores= hyperthreading= memory=\n
Disks: (SSD/HDD, count)"
validations:
required: false
- type: textarea
id: reproducer-steps
attributes:
label: Reproduction Steps
placeholder: Describe how to reproduce the problem
value: "The steps to reproduce the problem are:"
validations:
required: true
- type: textarea
id: the-problem
attributes:
label: What is the problem?
placeholder: Describe the problem you found
value: "The problem is that"
validations:
required: true
- type: textarea
id: what-happened
attributes:
label: Expected behavior?
placeholder: Describe what should have happened
value: "I expected that "
validations:
required: true
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell

View File

@@ -112,10 +112,15 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
# Check if the branch already exists in the remote fork
remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)
if not remote_refs:
# Branch does not exist, create it with a regular push
repo_local.git.push(fork_repo, new_branch_name)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
else:
logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")

16
.github/seastar-bad-include.json vendored Normal file
View File

@@ -0,0 +1,16 @@
{
"problemMatcher": [
{
"owner": "seastar-bad-include",
"severity": "error",
"pattern": [
{
"regexp": "^(.+):(\\d+):(.+)$",
"file": 1,
"line": 2,
"message": 3
}
]
}
]
}

View File

@@ -0,0 +1,11 @@
name: Call Jira Status In Progress
on:
pull_request:
types: [opened]
jobs:
call-jira-status-in-progress:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_progress.yml@main
secrets:
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,11 @@
name: Call Jira Status In Review
on:
pull_request:
types: [ready_for_review, review_requested]
jobs:
call-jira-status-in-review:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_review.yml@main
secrets:
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,13 @@
name: Call Jira Status Ready For Merge
on:
pull_request:
types: [labeled]
jobs:
call-jira-status-update:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_ready_for_merge.yml@main
with:
label_name: 'status/merge_candidate'
secrets:
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,9 +1,16 @@
name: Notify PR Authors of Conflicts
permissions:
issues: write
pull-requests: write
on:
push:
branches:
- 'master'
- 'branch-*'
schedule:
- cron: '0 10 * * 1,4' # Runs every Monday and Thursday at 10:00am
workflow_dispatch: # Manual trigger for testing
- cron: '0 10 * * 1' # Runs every Monday at 10:00am
jobs:
notify_conflict_prs:
@@ -14,32 +21,134 @@ jobs:
uses: actions/github-script@v7
with:
script: |
console.log("Starting conflict reminder script...");
// Print trigger event
if (process.env.GITHUB_EVENT_NAME) {
console.log(`Workflow triggered by: ${process.env.GITHUB_EVENT_NAME}`);
} else {
console.log("Could not determine workflow trigger event.");
}
const isPushEvent = process.env.GITHUB_EVENT_NAME === 'push';
console.log(`isPushEvent: ${isPushEvent}`);
const twoMonthsAgo = new Date();
twoMonthsAgo.setMonth(twoMonthsAgo.getMonth() - 2);
const prs = await github.paginate(github.rest.pulls.list, {
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
per_page: 100
});
console.log(`Fetched ${prs.length} open PRs`);
const recentPrs = prs.filter(pr => new Date(pr.created_at) >= twoMonthsAgo);
const validBaseBranches = ['master'];
const branchPrefix = 'branch-';
const threeDaysAgo = new Date();
const conflictLabel = 'conflicts';
threeDaysAgo.setDate(threeDaysAgo.getDate() - 3);
for (const pr of prs) {
if (!pr.base.ref.startsWith(branchPrefix)) continue;
const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);
if (!hasConflictLabel) continue;
const oneWeekAgo = new Date();
const conflictLabel = 'conflicts';
oneWeekAgo.setDate(oneWeekAgo.getDate() - 7);
console.log(`One week ago: ${oneWeekAgo.toISOString()}`);
for (const pr of recentPrs) {
console.log(`Checking PR #${pr.number} on base branch '${pr.base.ref}'`);
const isBranchX = pr.base.ref.startsWith(branchPrefix);
const isMaster = validBaseBranches.includes(pr.base.ref);
if (!(isBranchX || isMaster)) {
console.log(`PR #${pr.number} skipped: base branch is not 'master' or does not start with '${branchPrefix}'`);
continue;
}
const updatedDate = new Date(pr.updated_at);
if (updatedDate >= threeDaysAgo) continue;
if (pr.assignee === null) continue;
const assignee = pr.assignee.login;
if (assignee) {
await github.rest.issues.createComment({
console.log(`PR #${pr.number} last updated at: ${updatedDate.toISOString()}`);
if (!isPushEvent && updatedDate >= oneWeekAgo) {
console.log(`PR #${pr.number} skipped: updated within last week`);
continue;
}
if (pr.assignee === null) {
console.log(`PR #${pr.number} skipped: no assignee`);
continue;
}
// Fetch PR details to check mergeability
let { data: prDetails } = await github.rest.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
});
console.log(`PR #${pr.number} mergeable: ${prDetails.mergeable}`);
// Wait and re-fetch if mergeable is null
if (prDetails.mergeable === null) {
console.log(`PR #${pr.number} mergeable is null, waiting 2 seconds and retrying...`);
await new Promise(resolve => setTimeout(resolve, 2000)); // wait 2 seconds
prDetails = (await github.rest.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
})).data;
console.log(`PR #${pr.number} mergeable after retry: ${prDetails.mergeable}`);
}
if (prDetails.mergeable === false) {
const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);
console.log(`PR #${pr.number} has conflict label: ${hasConflictLabel}`);
// Fetch comments to check for existing notifications
const comments = await github.paginate(github.rest.issues.listComments, {
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
body: `@${assignee}, this PR has been open with conflicts. Please resolve the conflicts so we can merge it.`,
per_page: 100,
});
console.log(`Notified @${assignee} for PR #${pr.number}`);
}
// Find last notification comment from the bot
const notificationPrefix = `@${pr.assignee.login}, this PR has merge conflicts with the base branch.`;
const lastNotification = comments
.filter(c =>
c.user.type === "Bot" &&
c.body.startsWith(notificationPrefix)
)
.sort((a, b) => new Date(b.created_at) - new Date(a.created_at))[0];
// Check if we should skip notification based on recent notification
let shouldSkipNotification = false;
if (lastNotification) {
const lastNotified = new Date(lastNotification.created_at);
if (lastNotified >= oneWeekAgo) {
console.log(`PR #${pr.number} skipped: last notification was less than 1 week ago`);
shouldSkipNotification = true;
}
}
// Additional check for push events on draft PRs with conflict labels
if (
isPushEvent &&
pr.draft === true &&
hasConflictLabel &&
shouldSkipNotification
) {
continue;
}
if (!hasConflictLabel) {
await github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
labels: [conflictLabel],
});
console.log(`Added 'conflicts' label to PR #${pr.number}`);
}
const assignee = pr.assignee.login;
if (assignee && !shouldSkipNotification) {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
body: `@${assignee}, this PR has merge conflicts with the base branch. Please resolve the conflicts so we can merge it.`,
});
console.log(`Notified @${assignee} for PR #${pr.number}`);
}
} else {
console.log(`PR #${pr.number} is mergeable, no action needed.`);
}
}
console.log(`Total PRs checked: ${prs.length}`);

View File

@@ -11,7 +11,8 @@ env:
CLEANER_OUTPUT_PATH: build/clang-include-cleaner.log
# the "idl" subdirectory does not contain C++ source code. the .hh files in it are
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops redis replica
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
permissions: {}
@@ -80,7 +81,24 @@ jobs:
done
- run: |
echo "::remove-matcher owner=clang-include-cleaner::"
- run: |
echo "::add-matcher::.github/seastar-bad-include.json"
- name: check for seastar includes
run: |
git -c safe.directory="$PWD" \
grep -nE '#include +"seastar/' \
| tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"
- run: |
echo "::remove-matcher owner=seastar-bad-include::"
- uses: actions/upload-artifact@v4
with:
name: Logs (clang-include-cleaner)
path: "./${{ env.CLEANER_OUTPUT_PATH }}"
name: Logs
path: |
${{ env.CLEANER_OUTPUT_PATH }}
${{ env.SEASTAR_BAD_INCLUDE_OUTPUT_PATH }}
- name: fail if seastar headers are included as an internal library
run: |
if [ -s "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH" ]; then
echo "::error::Found #include \"seastar/ in the source code. Use angle brackets instead."
exit 1
fi

View File

@@ -16,6 +16,13 @@ jobs:
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:

View File

@@ -13,6 +13,8 @@ jobs:
issues: write
pull-requests: write
steps:
- name: Wait for label to be added
run: sleep 1m
- uses: mheap/github-action-required-labels@v5
with:
mode: minimum

View File

@@ -2,7 +2,7 @@ name: Urgent Issue Reminder
on:
schedule:
- cron: '10 8 * * 1' # Runs every Monday at 8 AM
- cron: '10 8 * * *' # Runs daily at 8 AM
jobs:
reminder:

2
.gitignore vendored
View File

@@ -35,3 +35,5 @@ compile_commands.json
.envrc
clang_build
.idea/
nuke
rust/target

3
.gitmodules vendored
View File

@@ -9,9 +9,6 @@
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-tools"]
path = tools/java
url = ../scylla-tools-java
[submodule "scylla-python3"]
path = tools/python3
url = ../scylla-python3

View File

@@ -163,14 +163,6 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")
include(add_version_library)
generate_scylla_version()
add_library(scylla-zstd STATIC
zstd.cc)
target_link_libraries(scylla-zstd
PRIVATE
db
Seastar::seastar
zstd::libzstd)
add_library(scylla-main STATIC)
target_sources(scylla-main
PRIVATE
@@ -179,17 +171,16 @@ target_sources(scylla-main
client_data.cc
clocks-impl.cc
collection_mutation.cc
compress.cc
converting_mutation_partition_applier.cc
counters.cc
direct_failure_detector/failure_detector.cc
sstable_dict_autotrainer.cc
duration.cc
exceptions/exceptions.cc
frozen_schema.cc
generic_server.cc
debug.cc
init.cc
keys.cc
keys/keys.cc
multishard_mutation_query.cc
mutation_query.cc
node_ops/task_manager_module.cc
@@ -204,6 +195,7 @@ target_sources(scylla-main
reader_concurrency_semaphore_group.cc
schema_mutations.cc
serializer.cc
service/direct_failure_detector/failure_detector.cc
sstables_loader.cc
table_helper.cc
tasks/task_handler.cc
@@ -214,7 +206,6 @@ target_sources(scylla-main
vint-serialization.cc)
target_link_libraries(scylla-main
PRIVATE
"$<LINK_LIBRARY:WHOLE_ARCHIVE,scylla-zstd>"
db
absl::headers
absl::btree
@@ -371,3 +362,6 @@ endif()
if(Scylla_BUILD_INSTRUMENTED)
add_subdirectory(pgo)
endif()
add_executable(patchelf
tools/patchelf.cc)

View File

@@ -220,28 +220,9 @@ On a development machine, one might run Scylla as
$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes
```
To interact with scylla it is recommended to build our versions of
cqlsh and nodetool. They are available at
https://github.com/scylladb/scylla-tools-java and can be built with
```bash
$ sudo ./install-dependencies.sh
$ ant jar
```
cqlsh should work out of the box, but nodetool depends on a running
scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build
with
```bash
$ mvn package
```
and must be started with
```bash
$ ./scripts/scylla-jmx
```
To interact with scylla it is recommended to build our version of
cqlsh. It is available at
https://github.com/scylladb/scylla-cqlsh and is available as a submodule.
### Branches and tags

View File

@@ -1,9 +1,6 @@
This project includes code developed by the Apache Software Foundation (http://www.apache.org/),
especially Apache Cassandra.
It includes files from https://github.com/antonblanchard/crc32-vpmsum (author Anton Blanchard <anton@au.ibm.com>, IBM).
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.2.0-dev
VERSION=2025.4.0-dev
if test -f version
then

View File

@@ -24,7 +24,7 @@ static constexpr uint64_t KB = 1024ULL;
static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;
static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;
static bool should_add_capacity(const rjson::value& request) {
bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {
const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");
if (!return_consumed) {
return false;
@@ -62,15 +62,22 @@ static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_by
rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :
consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {
}
uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);
}
uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, _total_bytes, _is_quorum);
return get_half_units(_total_bytes, _is_quorum);
}
uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, _total_bytes, true);
}
uint64_t wcu_consumed_capacity_counter::get_units(uint64_t total_bytes) noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, total_bytes, true) * HALF_UNIT_MULTIPLIER;
}
wcu_consumed_capacity_counter::wcu_consumed_capacity_counter(const rjson::value& request) :
consumed_capacity_counter(should_add_capacity(request)) {
}

View File

@@ -42,21 +42,25 @@ public:
*/
virtual uint64_t get_half_units() const noexcept = 0;
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
bool _is_quorum = false;
public:
rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);
rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}
virtual uint64_t get_half_units() const noexcept;
static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;
};
class wcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
public:
wcu_consumed_capacity_counter(const rjson::value& request);
static uint64_t get_units(uint64_t total_bytes) noexcept;
};
}

View File

@@ -167,4 +167,8 @@ future<> controller::request_stop_server() {
});
}
future<utils::chunked_vector<client_data>> controller::get_client_data() {
return _server.local().get_client_data();
}
}

View File

@@ -90,6 +90,10 @@ public:
virtual future<> start_server() override;
virtual future<> stop_server() override;
virtual future<> request_stop_server() override;
// This virtual function is called (on each shard separately) when the
// virtual table "system.clients" is read. It is expected to generate a
// list of clients connected to this server (on this shard).
virtual future<utils::chunked_vector<client_data>> get_client_data() override;
};
}

File diff suppressed because it is too large Load Diff

View File

@@ -10,8 +10,8 @@
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include <seastar/json/json_elements.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
#include "service/migration_manager.hh"
#include "service/client_state.hh"
@@ -58,29 +58,6 @@ namespace alternator {
class rmw_operation;
struct make_jsonable : public json::jsonable {
rjson::value _value;
public:
explicit make_jsonable(rjson::value&& value);
std::string to_json() const override;
};
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
json::json_return_type make_streamed(rjson::value&&);
struct json_string : public json::jsonable {
std::string _value;
public:
explicit json_string(std::string&& value);
std::string to_json() const override;
};
namespace parsed {
class path;
};
@@ -169,8 +146,23 @@ class executor : public peering_sharded_service<executor> {
public:
using client_state = service::client_state;
using request_return_type = std::variant<json::json_return_type, api_error>;
// request_return_type is the return type of the executor methods, which
// can be one of:
// 1. A string, which is the response body for the request.
// 2. A body_writer, an asynchronous function (returning future<>) that
// takes an output_stream and writes the response body into it.
// 3. An api_error, which is an error response that should be returned to
// the client.
// The body_writer is used for streaming responses, where the response body
// is written in chunks to the output_stream. This allows for efficient
// handling of large responses without needing to allocate a large buffer
// in memory.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
using request_return_type = std::variant<std::string, body_writer, api_error>;
stats _stats;
// The metric_groups object holds this stat object's metrics registered
// as long as the stats object is alive.
seastar::metrics::metric_groups _metrics;
static constexpr auto ATTRS_COLUMN_NAME = ":attrs";
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
@@ -220,6 +212,7 @@ public:
private:
static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;
public:
static schema_ptr find_table(service::storage_proxy&, std::string_view table_name);
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
private:
@@ -241,7 +234,8 @@ public:
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
@@ -250,7 +244,7 @@ public:
uint64_t* item_length_in_bytes = nullptr,
bool = false);
static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};
@@ -271,4 +265,13 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
executor::body_writer make_streamed(rjson::value&&);
}

View File

@@ -165,7 +165,9 @@ static std::optional<std::string> resolve_path_component(const std::string& colu
fmt::format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
return std::string(rjson::to_string_view(*value));
auto result = std::string(rjson::to_string_view(*value));
validate_attr_name_length("", result.size(), false, "ExpressionAttributeNames contains invalid value: ");
return result;
}
return std::nullopt;
}
@@ -737,6 +739,26 @@ rjson::value calculate_value(const parsed::set_rhs& rhs,
return rjson::null_value();
}
void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix) {
constexpr const size_t DYNAMODB_KEY_ATTR_NAME_SIZE_MAX = 255;
constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;
const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;
if (attr_name_length > max_length) {
std::string error_msg;
if (!error_msg_prefix.empty()) {
error_msg += error_msg_prefix;
}
if (!supplementary_context.empty()) {
error_msg += "in ";
error_msg += supplementary_context;
error_msg += " - ";
}
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
throw api_error::validation(error_msg);
}
}
} // namespace alternator
auto fmt::formatter<alternator::parsed::path>::format(const alternator::parsed::path& p, fmt::format_context& ctx) const

View File

@@ -91,5 +91,7 @@ rjson::value calculate_value(const parsed::value& v,
rjson::value calculate_value(const parsed::set_rhs& rhs,
const rjson::value* previous_item);
void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix = {});
} /* namespace alternator */

View File

@@ -10,11 +10,12 @@
#include "seastarx.hh"
#include "service/paxos/cas_request.hh"
#include "service/cas_shard.hh"
#include "utils/rjson.hh"
#include "consumed_capacity.hh"
#include "executor.hh"
#include "tracing/trace_state.hh"
#include "keys.hh"
#include "keys/keys.hh"
namespace alternator {
@@ -114,13 +115,15 @@ public:
const rjson::value& request() const { return _request; }
rjson::value&& move_request() && { return std::move(_request); }
future<executor::request_return_type> execute(service::storage_proxy& proxy,
std::optional<service::cas_shard> cas_shard,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit,
bool needs_read_before_write,
stats& stats,
stats& global_stats,
stats& per_table_stats,
uint64_t& wcu_total);
std::optional<shard_id> shard_for_execute(bool needs_read_before_write);
std::optional<service::cas_shard> shard_for_execute(bool needs_read_before_write);
};
} // namespace alternator

View File

@@ -13,7 +13,7 @@
#include <optional>
#include "types/types.hh"
#include "schema/schema_fwd.hh"
#include "keys.hh"
#include "keys/keys.hh"
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"

View File

@@ -13,7 +13,6 @@
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
@@ -31,6 +30,7 @@
#include "gms/gossiper.hh"
#include "utils/overloaded_functor.hh"
#include "utils/aws_sigv4.hh"
#include "client_data.hh"
static logging::logger slogger("alternator-server");
@@ -124,22 +124,22 @@ public:
}
auto res = resf.get();
std::visit(overloaded_functor {
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
}
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, res);
[&] (std::string&& str) {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(str);
},
[&] (executor::body_writer&& body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// correct one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(body_writer));
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, std::move(res));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
@@ -228,9 +228,8 @@ protected:
// If the rack does not exist, we return an empty list - not an error.
sstring query_rack = req->get_query_param("rack");
for (auto& id : local_dc_nodes) {
auto ip = _gossiper.get_address_map().get(id);
if (!query_rack.empty()) {
auto rack = _gossiper.get_application_state_value(ip, gms::application_state::RACK);
auto rack = _gossiper.get_application_state_value(id, gms::application_state::RACK);
if (rack != query_rack) {
continue;
}
@@ -238,10 +237,10 @@ protected:
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
if (_gossiper.is_alive(id) && _gossiper.is_normal(id)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(id)));
}
}
rep->set_status(reply::status_type::ok);
@@ -432,6 +431,13 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
SCYLLA_ASSERT(req->content_stream);
chunked_content content = co_await util::read_entire_stream(*req->content_stream);
auto username = co_await verify_signature(*req, content);
// As long as the system_clients_entry object is alive, this request will
// be visible in the "system.clients" virtual table. When requested, this
// entry will be formatted by server::ongoing_request::make_client_data().
auto system_clients_entry = _ongoing_requests.emplace(
req->get_client_address(), req->get_header("User-Agent"),
username, current_scheduling_group(),
req->get_protocol_name() == "https");
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
@@ -463,6 +469,9 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
client_state = std::move(client_state), trace_state = std::move(trace_state),
units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {
rjson::value json_request = co_await _json_parser.parse(std::move(content));
if (!json_request.IsObject()) {
co_return api_error::validation("Request content must be an object");
}
co_return co_await callback(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
};
@@ -505,7 +514,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
, _pending_requests{}
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
@@ -679,6 +688,37 @@ future<> server::json_parser::stop() {
return std::move(_run_parse_json_thread);
}
// Convert an entry in the server's list of ongoing Alternator requests
// (_ongoing_requests) into a client_data object. This client_data object
// will then be used to produce a row for the "system.clients" virtual table.
client_data server::ongoing_request::make_client_data() const {
client_data cd;
cd.ct = client_type::alternator;
cd.ip = _client_address.addr();
cd.port = _client_address.port();
cd.shard_id = this_shard_id();
cd.connection_stage = client_connection_stage::established;
cd.username = _username;
cd.scheduling_group_name = _scheduling_group.name();
cd.ssl_enabled = _is_https;
// For now, we save the full User-Agent header as the "driver name"
// and keep "driver_version" unset.
cd.driver_name = _user_agent;
// Leave "protocol_version" unset, it has no meaning in Alternator.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.
// As reported in issue #9216, we never set these fields in CQL
// either (see cql_server::connection::make_client_data()).
return cd;
}
future<utils::chunked_vector<client_data>> server::get_client_data() {
utils::chunked_vector<client_data> ret;
co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {
ret.emplace_back(r.make_client_data());
});
co_return ret;
}
const char* api_error::what() const noexcept {
if (_what_string.empty()) {
_what_string = fmt::format("{} {}: {}", std::to_underlying(_http_code), _type, _msg);

View File

@@ -9,6 +9,7 @@
#pragma once
#include "alternator/executor.hh"
#include "utils/scoped_item_list.hh"
#include <seastar/core/future.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/http/httpd.hh>
@@ -20,6 +21,8 @@
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
struct client_data;
namespace alternator {
using chunked_content = rjson::chunked_content;
@@ -41,7 +44,7 @@ class server : public peering_sharded_service<server> {
key_cache _key_cache;
utils::updateable_value<bool> _enforce_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
gate _pending_requests;
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
@@ -74,12 +77,30 @@ class server : public peering_sharded_service<server> {
};
json_parser _json_parser;
// The server maintains a list of ongoing requests, that are being handled
// by handle_api_request(). It uses this list in get_client_data(), which
// is called when reading the "system.clients" virtual table.
struct ongoing_request {
socket_address _client_address;
sstring _user_agent;
sstring _username;
scheduling_group _scheduling_group;
bool _is_https;
client_data make_client_data() const;
};
utils::scoped_item_list<ongoing_request> _ongoing_requests;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
// get_client_data() is called (on each shard separately) when the virtual
// table "system.clients" is read. It is expected to generate a list of
// clients connected to this server (on this shard). This function is
// called by alternator::controller::get_client_data().
future<utils::chunked_vector<client_data>> get_client_data();
private:
void set_routes(seastar::httpd::routes& r);
// If verification succeeds, returns the authenticated user's username

View File

@@ -14,28 +14,58 @@
namespace alternator {
const char* ALTERNATOR_METRICS = "alternator";
static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {
seastar::metrics::histogram res;
res.buckets.resize(histogram.bucket_offsets.size());
uint64_t cumulative_count = 0;
res.sample_count = histogram._count;
res.sample_sum = histogram._sample_sum;
for (size_t i = 0; i < res.buckets.size(); i++) {
auto& v = res.buckets[i];
v.upper_bound = histogram.bucket_offsets[i];
cumulative_count += histogram.buckets[i];
v.count = cumulative_count;
}
return res;
}
static seastar::metrics::label column_family_label("cf");
static seastar::metrics::label keyspace_label("ks");
static void register_metrics_with_optional_table(seastar::metrics::metric_groups& metrics, const stats& stats, const sstring& ks, const sstring& table) {
stats::stats() : api_operations{} {
// Register the
seastar::metrics::label op("op");
_metrics.add_group("alternator", {
bool has_table = table.length();
std::vector<seastar::metrics::label> aggregate_labels;
std::vector<seastar::metrics::label_instance> labels = {alternator_label};
sstring group_name = (has_table)? "alternator_table" : "alternator";
if (has_table) {
labels.push_back(column_family_label(table));
labels.push_back(keyspace_label(ks));
aggregate_labels.push_back(seastar::metrics::shard_label);
}
metrics.add_group(group_name, {
#define OPERATION(name, CamelCaseName) \
seastar::metrics::make_total_operations("operation", api_operations.name, \
seastar::metrics::description("number of operations via Alternator API"), {op(CamelCaseName), alternator_label, basic_level}).set_skip_when_empty(),
seastar::metrics::make_total_operations("operation", stats.api_operations.name, \
seastar::metrics::description("number of operations via Alternator API"), labels)(basic_level)(op(CamelCaseName)).aggregate(aggregate_labels).set_skip_when_empty(),
#define OPERATION_LATENCY(name, CamelCaseName) \
metrics.add_group(group_name, { \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName), alternator_label, basic_level}, [this]{return to_metrics_histogram(api_operations.name.histogram());}).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(), \
seastar::metrics::description("Latency histogram of an operation via Alternator API"), labels, [&stats]{return to_metrics_histogram(stats.api_operations.name.histogram());})(op(CamelCaseName))(basic_level).aggregate({seastar::metrics::shard_label}).set_skip_when_empty()}); \
if (!has_table) {\
metrics.add_group("alternator", { \
seastar::metrics::make_summary("op_latency_summary", \
seastar::metrics::description("Latency summary of an operation via Alternator API"), [this]{return to_metrics_summary(api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty(),
seastar::metrics::description("Latency summary of an operation via Alternator API"), [&stats]{return to_metrics_summary(stats.api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty()}); \
}
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
OPERATION(create_table, "CreateTable")
OPERATION(delete_backup, "DeleteBackup")
OPERATION(delete_item, "DeleteItem")
OPERATION(delete_table, "DeleteTable")
OPERATION(describe_backup, "DescribeBackup")
OPERATION(describe_continuous_backups, "DescribeContinuousBackups")
OPERATION(describe_endpoints, "DescribeEndpoints")
@@ -64,55 +94,74 @@ stats::stats() : api_operations{} {
OPERATION(update_item, "UpdateItem")
OPERATION(update_table, "UpdateTable")
OPERATION(update_time_to_live, "UpdateTimeToLive")
OPERATION_LATENCY(put_item_latency, "PutItem")
OPERATION_LATENCY(get_item_latency, "GetItem")
OPERATION_LATENCY(delete_item_latency, "DeleteItem")
OPERATION_LATENCY(update_item_latency, "UpdateItem")
OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")
OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")
OPERATION(list_streams, "ListStreams")
OPERATION(describe_stream, "DescribeStream")
OPERATION(get_shard_iterator, "GetShardIterator")
OPERATION(get_records, "GetRecords")
OPERATION_LATENCY(get_records_latency, "GetRecords")
});
_metrics.add_group("alternator", {
seastar::metrics::make_total_operations("unsupported_operations", unsupported_operations,
seastar::metrics::description("number of unsupported operations via Alternator API"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("total_operations", total_operations,
seastar::metrics::description("number of total operations via Alternator API"))(basic_level)(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("reads_before_write", reads_before_write,
seastar::metrics::description("number of performed read-before-write operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("write_using_lwt", write_using_lwt,
seastar::metrics::description("number of writes that used LWT"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("shard_bounce_for_lwt", shard_bounce_for_lwt,
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure."))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("requests_shed", requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload."))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,
seastar::metrics::description("number of rows read and matched during filtering operations")),
seastar::metrics::make_counter("rcu_total", rcu_total,
seastar::metrics::description("total number of consumed read units, counted as half units"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::PUT_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("PutItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::DELETE_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("DeleteItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::UPDATE_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("UpdateItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::INDEX],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("Index")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
api_operations.batch_write_item_batch_total)(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
api_operations.batch_get_item_batch_total)(alternator_label).set_skip_when_empty(),
OPERATION_LATENCY(put_item_latency, "PutItem")
OPERATION_LATENCY(get_item_latency, "GetItem")
OPERATION_LATENCY(delete_item_latency, "DeleteItem")
OPERATION_LATENCY(update_item_latency, "UpdateItem")
OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")
OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")
OPERATION_LATENCY(get_records_latency, "GetRecords")
if (!has_table) {
// Create and delete operations are not applicable to a per-table metrics
// only register it for the global metrics
metrics.add_group("alternator", {
OPERATION(create_table, "CreateTable")
OPERATION(delete_table, "DeleteTable")
});
}
metrics.add_group(group_name, {
seastar::metrics::make_total_operations("unsupported_operations", stats.unsupported_operations,
seastar::metrics::description("number of unsupported operations via Alternator API"), labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("total_operations", stats.total_operations,
seastar::metrics::description("number of total operations via Alternator API"), labels)(basic_level).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("reads_before_write", stats.reads_before_write,
seastar::metrics::description("number of performed read-before-write operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("write_using_lwt", stats.write_using_lwt,
seastar::metrics::description("number of writes that used LWT"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("shard_bounce_for_lwt", stats.shard_bounce_for_lwt,
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("requests_blocked_memory", stats.requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure."), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("requests_shed", stats.requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload."), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_read_total", stats.cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_matched_total", stats.cql_stats.filtered_rows_matched_total,
seastar::metrics::description("number of rows read and matched during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("rcu_total", [&stats]{return 0.5 * stats.rcu_half_units_total;},
seastar::metrics::description("total number of consumed read units"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::PUT_ITEM],
seastar::metrics::description("total number of consumed write units"), labels)(op("PutItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::DELETE_ITEM],
seastar::metrics::description("total number of consumed write units"), labels)(op("DeleteItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::UPDATE_ITEM],
seastar::metrics::description("total number of consumed write units"), labels)(op("UpdateItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::INDEX],
seastar::metrics::description("total number of consumed write units"), labels)(op("Index")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [&stats] { return stats.cql_stats.filtered_rows_read_total - stats.cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,
stats.api_operations.batch_write_item_batch_total)(op("BatchWriteItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,
stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
}
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {
register_metrics_with_optional_table(metrics, stats, "", "");
}
table_stats::table_stats(const sstring& ks, const sstring& table) {
_stats = make_lw_shared<stats>();
register_metrics_with_optional_table(_metrics, *_stats, ks, table);
}
}

View File

@@ -12,6 +12,7 @@
#include <seastar/core/metrics_registration.hh>
#include "utils/histogram.hh"
#include "utils/estimated_histogram.hh"
#include "cql3/stats.hh"
namespace alternator {
@@ -21,7 +22,6 @@ namespace alternator {
// visible by the metrics REST API, with the "alternator" prefix.
class stats {
public:
stats();
// Count of DynamoDB API operations by types
struct {
uint64_t batch_get_item = 0;
@@ -75,6 +75,9 @@ public:
utils::timed_rate_moving_average_summary_and_histogram batch_write_item_latency;
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
} api_operations;
// Miscellaneous event counters
uint64_t total_operations = 0;
@@ -84,7 +87,7 @@ public:
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
uint64_t rcu_total = 0;
uint64_t rcu_half_units_total = 0;
// wcu can results from put, update, delete and index
// Index related will be done on top of the operation it comes with
enum wcu_types {
@@ -98,10 +101,13 @@ public:
uint64_t wcu_total[NUM_TYPES] = {0};
// CQL-derived stats
cql3::cql_stats cql_stats;
private:
// The metric_groups object holds this stat object's metrics registered
// as long as the stats object is alive.
seastar::metrics::metric_groups _metrics;
};
struct table_stats {
table_stats(const sstring& ks, const sstring& table);
seastar::metrics::metric_groups _metrics;
lw_shared_ptr<stats> _stats;
};
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);
}

View File

@@ -217,7 +217,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct shard_id {
@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// TODO: label
@@ -617,7 +617,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
}
@@ -770,7 +770,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto ret = rjson::empty_object();
rjson::add(ret, "ShardIterator", iter);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct event_id {
@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
if (limit > 1000) {
throw api_error::validation("Limit must be less than or equal to 1000");
}
auto db = _proxy.data_dictionary();
schema_ptr schema, base;
@@ -1018,7 +1021,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// ugh. figure out if we are and end-of-shard
@@ -1044,12 +1047,12 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}
void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");
if (!stream_enabled || !stream_enabled->IsBool()) {
throw api_error::validation("StreamSpecification needs boolean StreamEnabled");
@@ -1083,10 +1086,12 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
break;
}
builder.with_cdc_options(opts);
return true;
} else {
cdc::options opts;
opts.enabled(false);
builder.with_cdc_options(opts);
return false;
}
}

View File

@@ -81,11 +81,6 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
co_return api_error::validation("UpdateTimeToLive requires boolean Enabled");
}
bool enabled = v->GetBool();
// Alternator TTL doesn't yet work when the table uses tablets (#16567)
if (enabled && _proxy.local_db().find_keyspace(schema->ks_name()).get_replication_strategy().uses_tablets()) {
co_return api_error::validation("TTL not yet supported on a table using tablets (issue #16567). "
"Create a table with the tag 'experimental:initial_tablets' set to 'none' to use vnodes.");
}
v = rjson::find(*spec, "AttributeName");
if (!v || !v->IsString()) {
co_return api_error::validation("UpdateTimeToLive requires string AttributeName");
@@ -123,7 +118,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
// basically identical to the request's
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveSpecification", std::move(*spec));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
@@ -140,7 +135,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
}
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveDescription", std::move(desc));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// expiration_service is a sharded service responsible for cleaning up expired
@@ -291,7 +286,7 @@ static future<> expire_item(service::storage_proxy& proxy,
auto ck = clustering_key::from_exploded(exploded_ck);
m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));
}
std::vector<mutation> mutations;
utils::chunked_vector<mutation> mutations;
mutations.push_back(std::move(m));
return proxy.mutate(std::move(mutations),
db::consistency_level::LOCAL_QUORUM,
@@ -315,8 +310,10 @@ static size_t random_offset(size_t min, size_t max) {
// this range's primary node is down. For this we need to return not just
// a list of this node's secondary ranges - but also the primary owner of
// each of those ranges.
//
// The function is to be used with vnodes only
static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_secondary_ranges(
const locator::effective_replication_map_ptr& erm,
const locator::effective_replication_map* erm,
locator::host_id ep) {
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
@@ -327,6 +324,7 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
// FIXME: pass is_vnode=true to get_natural_replicas since the token is in tm.sorted_tokens()
host_id_vector_replica_set eps = erm->get_natural_replicas(tok);
if (eps.size() <= 1 || eps[1] != ep) {
prev_tok = tok;
@@ -396,7 +394,7 @@ class ranges_holder_primary {
dht::token_range_vector _token_ranges;
public:
explicit ranges_holder_primary(dht::token_range_vector token_ranges) : _token_ranges(std::move(token_ranges)) {}
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, locator::host_id ep) {
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map* erm, locator::host_id ep) {
co_return ranges_holder_primary(co_await erm->get_primary_ranges(ep));
}
std::size_t size() const { return _token_ranges.size(); }
@@ -416,7 +414,7 @@ public:
explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, locator::host_id>> token_ranges, const gms::gossiper& g)
: _token_ranges(std::move(token_ranges))
, _gossiper(g) {}
static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, locator::host_id ep, const gms::gossiper& g) {
static future<ranges_holder_secondary> make(const locator::vnode_effective_replication_map* erm, locator::host_id ep, const gms::gossiper& g) {
co_return ranges_holder_secondary(co_await get_secondary_ranges(erm, ep), g);
}
std::size_t size() const { return _token_ranges.size(); }
@@ -429,6 +427,8 @@ public:
}
};
// The token_ranges_owned_by_this_shard class is only used for vnodes, where the vnodes give a partition range for the entire node
// and such range still needs to be divided between the shards.
template<class primary_or_secondary_t>
class token_ranges_owned_by_this_shard {
schema_ptr _s;
@@ -522,7 +522,7 @@ struct scan_ranges_context {
// should be possible (and a must for issue #7751!).
lw_shared_ptr<service::pager::paging_state> paging_state = nullptr;
auto regular_columns =
s->regular_columns() | std::views::transform([] (const column_definition& cdef) { return cdef.id; })
s->regular_columns() | std::views::transform(&column_definition::id)
| std::ranges::to<query::column_id_vector>();
selection = cql3::selection::selection::wildcard(s);
query::partition_slice::option_set opts = selection->get_query_options();
@@ -655,6 +655,17 @@ static future<> scan_table_ranges(
}
}
static future<> scan_tablet(locator::tablet_id tablet, service::storage_proxy& proxy, abort_source& abort_source, named_semaphore& page_sem,
expiration_service::stats& expiration_stats, const scan_ranges_context& scan_ctx, const locator::tablet_map& tablet_map) {
auto tablet_token_range = tablet_map.get_token_range(tablet);
dht::ring_position tablet_start(tablet_token_range.start()->value(), dht::ring_position::token_bound::start),
tablet_end(tablet_token_range.end()->value(), dht::ring_position::token_bound::end);
auto partition_range = dht::partition_range::make(std::move(tablet_start), std::move(tablet_end));
// Note that because of issue #9167 we need to run a separate query on each partition range, and can't pass
// several of them into one partition_range_vector that is passed to scan_table_ranges().
return scan_table_ranges(proxy, scan_ctx, {partition_range}, abort_source, page_sem, expiration_stats);
}
// scan_table() scans, in one table, data "owned" by this shard, looking for
// expired items and deleting them.
// We consider each node to "own" its primary token ranges, i.e., the tokens
@@ -730,34 +741,69 @@ static future<bool> scan_table(
expiration_stats.scan_table++;
// FIXME: need to pace the scan, not do it all at once.
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();
auto my_host_id = erm->get_topology().my_host_id();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
// them into one partition_range_vector.
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
// FIXME: if scanning a single range fails, including network errors,
// we fail the entire scan (and rescan from the beginning). Need to
// reconsider this. Saving the scan position might be a good enough
// solution for this problem.
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
}
// If each node only scans its own primary ranges, then when any node is
// down part of the token range will not get scanned. This can be viewed
// as acceptable (when the comes back online, it will resume its scan),
// but as noted in issue #9787, we can allow more prompt expiration
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
if (s->table().uses_tablets()) {
locator::effective_replication_map_ptr erm = s->table().get_effective_replication_map();
auto my_host_id = erm->get_topology().my_host_id();
const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());
for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {
auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);
// check if this is the primary replica for the current tablet
if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
} else if(erm->get_replication_factor() > 1) {
// Check if this is the secondary replica for the current tablet
// and if the primary replica is down which means we will take over this work.
// If each node only scans its own primary ranges, then when any node is
// down part of the token range will not get scanned. This can be viewed
// as acceptable (when the comes back online, it will resume its scan),
// but as noted in issue #9787, we can allow more prompt expiration
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica
if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {
if (!gossiper.is_alive(tablet_primary_replica.host)) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
}
}
}
}
} else { // VNodes
locator::static_effective_replication_map_ptr ermp =
db.real_database().find_keyspace(s->ks_name()).get_static_effective_replication_map();
auto* erm = ermp->maybe_as_vnode_effective_replication_map();
if (!erm) {
on_internal_error(tlogger, format("Keyspace {} is local", s->ks_name()));
}
auto my_host_id = erm->get_topology().my_host_id();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
// them into one partition_range_vector.
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
// FIXME: if scanning a single range fails, including network errors,
// we fail the entire scan (and rescan from the beginning). Need to
// reconsider this. Saving the scan position might be a good enough
// solution for this problem.
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
}
// If each node only scans its own primary ranges, then when any node is
// down part of the token range will not get scanned. This can be viewed
// as acceptable (when the comes back online, it will resume its scan),
// but as noted in issue #9787, we can allow more prompt expiration
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
}
}
co_return true;
}

View File

@@ -246,6 +246,24 @@
}
}
},
"sstableinfo":{
"id":"sstableinfo",
"description":"Compacted sstable information",
"properties":{
"generation":{
"type": "string",
"description":"Generation of the sstable"
},
"origin":{
"type":"string",
"description":"Origin of the sstable"
},
"size":{
"type":"long",
"description":"Size of the sstable"
}
}
},
"compaction_info" :{
"id": "compaction_info",
"description":"A key value mapping",
@@ -327,6 +345,10 @@
"type":"string",
"description":"The UUID"
},
"shard_id":{
"type":"int",
"description":"The shard id the compaction was executed on"
},
"cf":{
"type":"string",
"description":"The column family name"
@@ -335,9 +357,17 @@
"type":"string",
"description":"The keyspace name"
},
"compaction_type":{
"type":"string",
"description":"Type of compaction"
},
"started_at":{
"type":"long",
"description":"The time compaction started"
},
"compacted_at":{
"type":"long",
"description":"The time of compaction"
"description":"The time compaction completed"
},
"bytes_in":{
"type":"long",
@@ -353,6 +383,32 @@
"type":"row_merged"
},
"description":"The merged rows"
},
"sstables_in": {
"type":"array",
"items":{
"type":"sstableinfo"
},
"description":"List of input sstables for compaction"
},
"sstables_out": {
"type":"array",
"items":{
"type":"sstableinfo"
},
"description":"List of output sstables from compaction"
},
"total_tombstone_purge_attempt":{
"type":"long",
"description":"Total number of tombstone purge attempts"
},
"total_tombstone_purge_failure_due_to_overlapping_with_memtable":{
"type":"long",
"description":"Number of tombstone purge failures due to data overlapping with memtables"
},
"total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable":{
"type":"long",
"description":"Number of tombstone purge failures due to data overlapping with non-compacting sstables"
}
}
}

View File

@@ -136,14 +136,6 @@
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"unsafe",
"description":"Set to True to perform an unsafe assassination",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -2144,6 +2144,31 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"skip_cleanup",
"description":"Don't cleanup keys from loaded sstables. Invalid if load_and_stream is true",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"skip_reshape",
"description":"Don't reshape the loaded sstables. Invalid if load_and_stream is true",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"scope",
"description":"Defines the set of nodes to which mutations can be streamed",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query",
"enum": ["all", "dc", "rack", "node"]
}
]
}
@@ -3027,6 +3052,73 @@
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
{
"method":"POST",
"summary":"Retrain the SSTable compression dictionary for the target table.",
"type":"void",
"nickname":"retrain_dict",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of the keyspace containing the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"Name of the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/estimate_compression_ratios",
"operations":[
{
"method":"GET",
"summary":"Compute an estimated compression ratio for SSTables of the given table, for various compression configurations.",
"type":"array",
"items":{
"type":"compression_config_result"
},
"nickname":"estimate_compression_ratios",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of the keyspace containing the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"Name of the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/raft_topology/reload",
"operations":[
@@ -3069,6 +3161,54 @@
]
}
]
},
{
"path":"/storage_service/raft_topology/cmd_rpc_status",
"operations":[
{
"method":"GET",
"summary":"Get information about currently running topology cmd rpc",
"type":"string",
"nickname":"raft_topology_get_cmd_status",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/drop_quarantined_sstables",
"operations":[
{
"method":"POST",
"summary":"Drops all quarantined sstables in all keyspaces or specified keyspace and tables",
"type":"void",
"nickname":"drop_quarantined_sstables",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace name to drop quarantined sstables from.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"tables",
"description":"Comma-separated table names to drop quarantined sstables from.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
}
],
"models":{
@@ -3328,6 +3468,32 @@
"type":"string"
}
}
},
"compression_config_result":{
"id":"compression_config_result",
"description":"Compression ratio estimation result for one config",
"properties":{
"level":{
"type":"long",
"description":"The used value of `compression_level`"
},
"chunk_length_in_kb":{
"type":"long",
"description":"The used value of `chunk_length_in_kb`"
},
"dict":{
"type":"string",
"description":"The used dictionary: `none`, `past` (== current), or `future`"
},
"sstable_compression":{
"type":"string",
"description":"The used compressor name (aka `sstable_compression`)"
},
"ratio":{
"type":"float",
"description":"The resulting compression ratio (estimated on a random sample of files)"
}
}
}
}
}

View File

@@ -391,32 +391,5 @@ future<> unset_server_raft(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });
}
void req_params::process(const request& req) {
// Process mandatory parameters
for (auto& [name, ent] : params) {
if (!ent.is_mandatory) {
continue;
}
try {
ent.value = req.get_path_param(name);
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));
}
}
// Process optional parameters
for (auto& [name, value] : req.query_parameters) {
try {
auto& ent = params.at(name);
if (ent.is_mandatory) {
throw httpd::bad_param_exception(fmt::format("Parameter '{}' is expected to be provided as part of the request url", name));
}
ent.value = value;
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Unsupported optional parameter '{}'", name));
}
}
}
}

View File

@@ -23,17 +23,6 @@
namespace api {
template<class T>
std::vector<sstring> container_to_vec(const T& container) {
std::vector<sstring> res;
res.reserve(std::size(container));
for (const auto& i : container) {
res.push_back(fmt::to_string(i));
}
return res;
}
template<class T>
std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {
std::vector<T> res;
@@ -67,17 +56,6 @@ T map_sum(T&& dest, const S& src) {
return std::move(dest);
}
template <typename MAP>
std::vector<sstring> map_keys(const MAP& map) {
std::vector<sstring> res;
res.reserve(std::size(map));
for (const auto& i : map) {
res.push_back(fmt::to_string(i.first));
}
return res;
}
/**
* General sstring splitting function
*/
@@ -252,67 +230,6 @@ public:
operator T() const { return value; }
};
using mandatory = bool_class<struct mandatory_tag>;
class req_params {
public:
struct def {
std::optional<sstring> value;
mandatory is_mandatory = mandatory::no;
def(std::optional<sstring> value_ = std::nullopt, mandatory is_mandatory_ = mandatory::no)
: value(std::move(value_))
, is_mandatory(is_mandatory_)
{ }
def(mandatory is_mandatory_)
: is_mandatory(is_mandatory_)
{ }
};
private:
std::unordered_map<sstring, def> params;
public:
req_params(std::initializer_list<std::pair<sstring, def>> l) {
for (const auto& [name, ent] : l) {
add(std::move(name), std::move(ent));
}
}
void add(sstring name, def ent) {
params.emplace(std::move(name), std::move(ent));
}
void process(const request& req);
const std::optional<sstring>& get(const char* name) const {
return params.at(name).value;
}
template <typename T = sstring>
const std::optional<T> get_as(const char* name) const {
return get(name);
}
template <typename T = sstring>
requires std::same_as<T, bool>
const std::optional<bool> get_as(const char* name) const {
auto value = get(name);
if (!value) {
return std::nullopt;
}
std::transform(value->begin(), value->end(), value->begin(), ::tolower);
if (value == "true" || value == "yes" || value == "1") {
return true;
}
if (value == "false" || value == "no" || value == "0") {
return false;
}
throw boost::bad_lexical_cast{};
}
};
httpd::utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);
}

View File

@@ -360,13 +360,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
std::vector<sstring> res;
const flat_hash_map<sstring, replica::keyspace>& keyspaces = ctx.db.local().get_keyspaces();
res.reserve(keyspaces.size());
for (const auto& i : keyspaces) {
res.push_back(i.first);
}
return res;
return ctx.db.local().get_all_keyspaces();
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -902,17 +896,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, std::move(tables), true);
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, std::move(tables), false);
});
@@ -936,25 +926,19 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, std::move(tables), true);
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, std::move(tables), false);
});
cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {
auto ks_cf = parse_fully_qualified_cf_name(req->get_path_param("name"));
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
auto [ks, cf_name] = parse_fully_qualified_cf_name(req->get_path_param("name"));
// Use of load_built_views() as filtering table should be in sync with
// built_indexes_virtual_reader filtering with BUILT_VIEWS table
return sys_ks.local().load_built_views().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_name>& vb) mutable {
@@ -1054,13 +1038,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
co_return sstables | std::views::transform([] (auto s) { return s->get_filename(); }) | std::ranges::to<std::unordered_set>();
co_return sstables | std::views::transform([] (auto s) -> sstring { return fmt::to_string(s->get_filename()); }) | std::ranges::to<std::unordered_set>();
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.merge(b);
return a;
}).then([](const std::unordered_set<sstring>& res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
return make_ready_future<json::json_return_type>(res | std::ranges::to<std::vector>());
});
});
@@ -1082,19 +1066,12 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::force_major_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto params = req_params({
std::pair("name", mandatory::yes),
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
std::pair("split_output", mandatory::no),
});
params.process(*req);
if (params.get("split_output")) {
if (req->query_parameters.contains("split_output")) {
fail(unimplemented::cause::API);
}
auto [ks, cf] = parse_fully_qualified_cf_name(*params.get("name"));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("column_family/force_major_compaction: name={} flush={} consider_only_existing_data={}", req->get_path_param("name"), flush, consider_only_existing_data);
auto keyspace = validate_keyspace(ctx, ks);

View File

@@ -28,10 +28,14 @@ template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer) {
auto uuid = parse_table_info(name, ctx.db.local()).id;
using mapper_type = std::function<std::unique_ptr<std::any>(replica::database&)>;
using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::database&)>;
using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;
return ctx.db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {
return std::make_unique<std::any>(I(mapper(db.find_column_family(uuid))));
return futurize_invoke([mapper, &db, uuid] {
return mapper(db.find_column_family(uuid));
}).then([] (auto result) {
return std::make_unique<std::any>(I(std::move(result)));
});
}), std::make_unique<std::any>(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {
return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));
})).then([] (std::unique_ptr<std::any> r) {
@@ -61,13 +65,12 @@ future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, c
struct map_reduce_column_families_locally {
std::any init;
std::function<std::unique_ptr<std::any>(replica::column_family&)> mapper;
std::function<future<std::unique_ptr<std::any>>(replica::column_family&)> mapper;
std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;
future<std::unique_ptr<std::any>> operator()(replica::database& db) const {
auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));
return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) {
*res = reducer(std::move(*res), mapper(*table.get()));
return make_ready_future();
return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) -> future<> {
*res = reducer(std::move(*res), co_await mapper(*table.get()));
}).then([res] () {
return std::move(*res);
});
@@ -77,10 +80,14 @@ struct map_reduce_column_families_locally {
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
using mapper_type = std::function<std::unique_ptr<std::any>(replica::column_family&)>;
using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::column_family&)>;
using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (replica::column_family& cf) mutable {
return std::make_unique<std::any>(I(mapper(cf)));
return futurize_invoke([&cf, mapper] {
return mapper(cf);
}).then([] (auto result) {
return std::make_unique<std::any>(I(std::move(result)));
});
});
auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {
return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));

View File

@@ -14,6 +14,7 @@
#include "api/api.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/compaction_history_entry.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
#include "unimplemented.hh"
@@ -71,10 +72,9 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man
cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return ctx.db.map_reduce0([](replica::database& db) {
return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {
return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) {
return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) -> future<> {
replica::table& cf = *table.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();
return make_ready_future<>();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = co_await cf.estimate_pending_compactions();
}).then([&tasks] {
return std::move(tasks);
});
@@ -111,14 +111,13 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man
});
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req);
auto tables = parse_table_infos(ks_name, ctx, req->query_parameters, "tables");
auto [ks_name, tables] = parse_table_infos(ctx, *req, "tables");
auto type = req->get_query_param("type");
co_await ctx.db.invoke_on_all([&] (replica::database& db) {
auto& cm = db.get_compaction_manager();
return parallel_for_each(tables, [&] (const table_info& ti) {
auto& t = db.find_column_family(ti.id);
return t.parallel_foreach_table_state([&] (compaction::table_state& ts) {
return t.parallel_foreach_compaction_group_view([&] (compaction::compaction_group_view& ts) {
return cm.stop_compaction(type, &ts);
});
});
@@ -160,8 +159,11 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man
co_await cm.local().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
cm::history h;
h.id = fmt::to_string(entry.id);
h.shard_id = entry.shard_id;
h.ks = std::move(entry.ks);
h.cf = std::move(entry.cf);
h.compaction_type = entry.compaction_type;
h.started_at = entry.started_at;
h.compacted_at = entry.compacted_at;
h.bytes_in = entry.bytes_in;
h.bytes_out = entry.bytes_out;
@@ -173,6 +175,24 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man
e.value = it.second;
h.rows_merged.push(std::move(e));
}
for (const auto& data : entry.sstables_in) {
httpd::compaction_manager_json::sstableinfo sstable;
sstable.generation = fmt::to_string(data.generation),
sstable.origin = data.origin,
sstable.size = data.size,
h.sstables_in.push(std::move(sstable));
}
for (const auto& data : entry.sstables_out) {
httpd::compaction_manager_json::sstableinfo sstable;
sstable.generation = fmt::to_string(data.generation),
sstable.origin = data.origin,
sstable.size = data.size,
h.sstables_out.push(std::move(sstable));
}
h.total_tombstone_purge_attempt = entry.total_tombstone_purge_attempt;
h.total_tombstone_purge_failure_due_to_overlapping_with_memtable = entry.total_tombstone_purge_failure_due_to_overlapping_with_memtable;
h.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable = entry.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable;
if (!first) {
co_await s.write(", ");
}

View File

@@ -23,22 +23,6 @@ using namespace seastar::httpd;
namespace sp = httpd::storage_proxy_json;
namespace ss = httpd::storage_service_json;
template<class T>
json::json_return_type get_json_return_type(const T& val) {
return json::json_return_type(val);
}
/*
* As commented on db::seed_provider_type is not used
* and probably never will.
*
* Just in case, we will return its name
*/
template<>
json::json_return_type get_json_return_type(const db::seed_provider_type& val) {
return json::json_return_type(val.class_name);
}
std::string_view format_type(std::string_view type) {
if (type == "int") {
return "integer";
@@ -187,7 +171,7 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
});
ss::get_all_data_file_locations.set(r, [&cfg](const_req req) {
return container_to_vec(cfg.data_file_directories());
return cfg.data_file_directories();
});
ss::get_saved_caches_location.set(r, [&cfg](const_req req) {

View File

@@ -22,10 +22,10 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::vector<fd::endpoint_state> res;
res.reserve(g.num_endpoints());
g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {
g.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
fd::endpoint_state val;
val.addrs = fmt::to_string(addr);
val.is_alive = g.is_alive(addr);
val.addrs = fmt::to_string(eps.get_ip());
val.is_alive = g.is_alive(eps.get_host_id());
val.generation = eps.get_heart_beat_state().get_generation().value();
val.version = eps.get_heart_beat_state().get_heart_beat_version().value();
val.update_time = eps.get_update_timestamp().time_since_epoch().count();
@@ -40,7 +40,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
}
res.emplace_back(std::move(val));
});
return make_ready_future<json::json_return_type>(res);
return make_ready_future<json::json_return_type>(json::stream_range_as_array(res, [](const fd::endpoint_state& i){
return i;
}));
});
});
@@ -64,11 +66,15 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::map<sstring, sstring> nodes_status;
g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {
nodes_status.emplace(fmt::to_string(node), g.is_alive(node) ? "UP" : "DOWN");
std::vector<fd::mapper> nodes_status;
nodes_status.reserve(g.num_endpoints());
g.for_each_endpoint_state([&] (const gms::endpoint_state& es) {
fd::mapper val;
val.key = fmt::to_string(es.get_ip());
val.value = g.is_alive(es.get_host_id()) ? "UP" : "DOWN";
nodes_status.emplace_back(std::move(val));
});
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
return make_ready_future<json::json_return_type>(std::move(nodes_status));
});
});
@@ -81,7 +87,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));
auto state = g.get_endpoint_state_ptr(g.get_host_id(gms::inet_address(req->get_path_param("addr"))));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));
}

View File

@@ -21,51 +21,45 @@ using namespace json;
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto res = co_await g.get_unreachable_members_synchronized();
co_return json::json_return_type(container_to_vec(res));
co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());
});
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {
return g.get_live_members_synchronized().then([] (auto res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto res = co_await g.get_live_members_synchronized();
co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->get_path_param("addr"));
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
co_return g.get_endpoint_downtime(g.get_host_id(ep));
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_generation_number(ep).then([] (gms::generation_type res) {
return g.get_current_generation_number(g.get_host_id(ep)).then([] (gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {
return g.get_current_heart_beat_version(g.get_host_id(ep)).then([] (gms::version_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
if (req->get_query_param("unsafe") != "True") {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {
return g.force_remove_endpoint(g.get_host_id(ep), gms::null_permit_id).then([] () {
return make_ready_future<json::json_return_type>(json_void());
});
});

View File

@@ -148,7 +148,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
hf::inject_disconnect.set(r, [&ms] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto ip = msg_addr(req->get_path_param("ip"));
co_await ms.invoke_on_all([ip] (netw::messaging_service& ms) {
ms.remove_rpc_client(ip);
ms.remove_rpc_client(ip, std::nullopt);
});
co_return json::json_void();
});

View File

@@ -11,7 +11,7 @@
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "seastar/json/json_elements.hh"
#include <seastar/json/json_elements.hh>
#include "transport/controller.hh"
#include <unordered_map>

View File

@@ -14,6 +14,9 @@
#include "api/scrub_status.hh"
#include "db/config.hh"
#include "db/schema_tables.hh"
#include "gms/feature_service.hh"
#include "schema/schema_builder.hh"
#include "sstables/sstables_manager.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
@@ -29,6 +32,7 @@
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "gms/feature_service.hh"
#include "gms/gossiper.hh"
#include "db/system_keyspace.hh"
#include <seastar/http/exception.hh>
@@ -55,6 +59,7 @@
#include "db/view/view_builder.hh"
#include "utils/rjson.hh"
#include "utils/user_provided_param.hh"
#include "sstable_dict_autotrainer.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
@@ -122,37 +127,26 @@ bool validate_bool(const sstring& param) {
}
}
bool validate_bool_x(const sstring& param, bool default_value) {
if (param.empty()) {
return default_value;
}
if (strcasecmp(param.c_str(), "true") == 0 || strcasecmp(param.c_str(), "yes") == 0 || param == "1") {
return true;
}
if (strcasecmp(param.c_str(), "false") == 0 || strcasecmp(param.c_str(), "no") == 0 || param == "0") {
return false;
}
throw std::runtime_error("Invalid boolean parameter value");
}
static
int64_t validate_int(const sstring& param) {
return std::atoll(param.c_str());
}
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, sstring value) {
if (value.empty()) {
return map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
}
std::vector<sstring> names = split(value, ",");
try {
for (const auto& table_name : names) {
ctx.db.local().find_column_family(ks_name, table_name);
}
} catch (const replica::no_such_column_family& e) {
throw bad_param_exception(e.what());
}
return names;
}
static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
if (it == query_params.end()) {
return {};
}
return parse_tables(ks_name, ctx, it->second);
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value) {
std::vector<table_info> res;
try {
@@ -178,9 +172,12 @@ std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_con
return res;
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
return parse_table_infos(ks_name, ctx, it != query_params.end() ? it->second : "");
std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name) {
auto keyspace = validate_keyspace(ctx, req);
const auto& query_params = req.query_parameters;
auto it = query_params.find(cf_param_name);
auto tis = parse_table_infos(keyspace, ctx, it != query_params.end() ? it->second : "");
return std::make_pair(std::move(keyspace), std::move(tis));
}
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
@@ -201,16 +198,6 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<http::request>, sstring, std::vector<table_info>)>;
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
@@ -243,28 +230,19 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition
future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req) {
scrub_info info;
auto rp = req_params({
{"keyspace", {mandatory::yes}},
{"cf", {""}},
{"scrub_mode", {}},
{"skip_corrupted", {}},
{"disable_snapshot", {}},
{"quarantine_mode", {}},
});
rp.process(*req);
info.keyspace = validate_keyspace(ctx, *rp.get("keyspace"));
info.column_families = parse_tables(info.keyspace, ctx, *rp.get("cf"));
auto scrub_mode_opt = rp.get("scrub_mode");
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
info.keyspace = std::move(keyspace);
info.column_families = table_infos | std::views::transform(&table_info::name) | std::ranges::to<std::vector>();
auto scrub_mode_str = req->get_query_param("scrub_mode");
auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;
if (!scrub_mode_opt) {
const auto skip_corrupted = rp.get_as<bool>("skip_corrupted").value_or(false);
if (scrub_mode_str.empty()) {
const auto skip_corrupted = validate_bool_x(req->get_query_param("skip_corrupted"), false);
if (skip_corrupted) {
scrub_mode = sstables::compaction_type_options::scrub::mode::skip;
}
} else {
auto scrub_mode_str = *scrub_mode_opt;
if (scrub_mode_str == "ABORT") {
scrub_mode = sstables::compaction_type_options::scrub::mode::abort;
} else if (scrub_mode_str == "SKIP") {
@@ -278,11 +256,9 @@ future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snap
}
}
if (!req_param<bool>(*req, "disable_snapshot", false)) {
if (!req_param<bool>(*req, "disable_snapshot", false) && !info.column_families.empty()) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
co_await coroutine::parallel_for_each(info.column_families, [&snap_ctl, keyspace = info.keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no);
});
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, tag, db::snapshot_ctl::skip_flush::no);
}
info.opts = {
@@ -383,6 +359,9 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair, s
// if the option is not sane, repair_start() throws immediately, so
// convert the exception to an HTTP error
throw httpd::bad_param_exception(e.what());
} catch (const tablets_unsupported& e) {
throw base_exception("Cannot repair tablet keyspace. Use /storage_service/tablets/repair to repair tablet keyspaces.",
http::reply::status_type::forbidden);
}
});
@@ -483,17 +462,27 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
auto skip_cleanup_p = req->get_query_param("skip_cleanup");
boost::algorithm::to_lower(stream);
boost::algorithm::to_lower(primary_replica);
bool load_and_stream = stream == "true" || stream == "1";
bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
bool skip_cleanup = skip_cleanup_p == "true" || skip_cleanup_p == "1";
auto scope = parse_stream_scope(req->get_query_param("scope"));
auto skip_reshape_p = req->get_query_param("skip_reshape");
auto skip_reshape = skip_reshape_p == "true" || skip_reshape_p == "1";
if (scope != sstables_loader::stream_scope::all && !load_and_stream) {
throw httpd::bad_param_exception("scope takes no effect without load-and-stream");
}
// No need to add the keyspace, since all we want is to avoid always sending this to the same
// CPU. Even then I am being overzealous here. This is not something that happens all the time.
auto coordinator = std::hash<sstring>()(cf) % smp::count;
return sst_loader.invoke_on(coordinator,
[ks = std::move(ks), cf = std::move(cf),
load_and_stream, primary_replica_only] (sstables_loader& loader) {
return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, sstables_loader::stream_scope::all);
load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope] (sstables_loader& loader) {
return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope);
}).then_wrapped([] (auto&& f) {
if (f.failed()) {
auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -662,7 +651,7 @@ rest_get_range_to_endpoint_map(http_context& ctx, sharded<service::storage_servi
auto& ks = ctx.db.local().find_keyspace(keyspace);
if (table.empty()) {
ensure_tablets_disabled(ctx, keyspace, "storage_service/range_to_endpoint_map");
return ks.get_vnode_effective_replication_map();
return ks.get_static_effective_replication_map();
} else {
auto table_id = validate_table(ctx.db.local(), keyspace, table);
auto& cf = ctx.db.local().find_column_family(table_id);
@@ -725,7 +714,7 @@ rest_get_load(http_context& ctx, std::unique_ptr<http::request> req) {
static
future<json::json_return_type>
rest_get_current_generation_number(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto ep = ss.local().get_token_metadata().get_topology().my_address();
auto ep = ss.local().get_token_metadata().get_topology().my_host_id();
return ss.local().gossiper().get_current_generation_number(ep).then([](gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
@@ -735,8 +724,8 @@ static
json::json_return_type
rest_get_natural_endpoints(http_context& ctx, sharded<service::storage_service>& ss, const_req req) {
auto keyspace = validate_keyspace(ctx, req);
return container_to_vec(ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"),
req.get_query_param("key")));
auto res = ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"), req.get_query_param("key"));
return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();
}
static
@@ -753,13 +742,8 @@ static
future<json::json_return_type>
rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto params = req_params({
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
});
params.process(*req);
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
@@ -768,13 +752,7 @@ rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_compaction failed: {}", std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -782,17 +760,9 @@ static
future<json::json_return_type>
rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto params = req_params({
std::pair("keyspace", mandatory::yes),
std::pair("cf", mandatory::no),
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
});
params.process(*req);
auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));
auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
@@ -801,13 +771,7 @@ rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request>
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -815,11 +779,10 @@ static
future<json::json_return_type>
rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.get_type() == locator::replication_strategy_type::local || !rs.is_vnode_based()) {
auto reason = rs.get_type() == locator::replication_strategy_type::local ? "require" : "support";
if (rs.is_local() || !rs.is_vnode_based()) {
auto reason = rs.is_local() ? "require" : "support";
apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
co_return json::json_return_type(0);
}
@@ -833,13 +796,7 @@ rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -861,49 +818,34 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<global_cleanup_compaction_task_impl>({}, db);
try {
co_await task->done();
} catch (...) {
apilog.error("cleanup_all failed: {}", std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
static
future<json::json_return_type>
rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);
bool res = false;
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);
try {
co_await task->done();
} catch (...) {
apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(res);
}
static
future<json::json_return_type>
rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
try {
co_await task->done();
} catch (...) {
apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -920,15 +862,10 @@ rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {
static
future<json::json_return_type>
rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, table_infos);
auto& db = ctx.db;
if (column_families.empty()) {
co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);
} else {
co_await replica::database::flush_tables_on_all_shards(db, keyspace, std::move(column_families));
}
co_await replica::database::flush_tables_on_all_shards(db, std::move(table_infos));
co_return json_void();
}
@@ -1060,7 +997,7 @@ rest_get_keyspaces(http_context& ctx, const_req req) {
} else if (type == "non_local_strategy") {
keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();
} else {
keyspaces = map_keys(ctx.db.local().get_keyspaces());
keyspaces = ctx.db.local().get_all_keyspaces();
}
if (replication.empty() || replication == "all") {
return keyspaces;
@@ -1448,6 +1385,95 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service
});
}
static
future<json::json_return_type>
rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().sstable_compression_dicts) {
apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");
throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);
auto s = ctx.db.local().find_column_family(ks, cf).schema();
auto training_sample = co_await ss.local().do_sample_sstables(s->id(), 4096, 4096);
auto validation_sample = co_await ss.local().do_sample_sstables(s->id(), 16*1024, 1024);
apilog.debug("estimate_compression_ratios: got training sample with {} blocks and validation sample with {}", training_sample.size(), validation_sample.size());
auto dict = co_await ss.local().train_dict(std::move(training_sample));
apilog.debug("estimate_compression_ratios: got dict of size {}", dict.size());
std::vector<ss::compression_config_result> res;
auto make_result = [](std::string_view name, int chunk_length_kb, std::string_view dict, int level, float ratio) -> ss::compression_config_result {
ss::compression_config_result x;
x.sstable_compression = sstring(name);
x.chunk_length_in_kb = chunk_length_kb;
x.dict = sstring(dict);
x.level = level;
x.ratio = ratio;
return x;
};
using algorithm = compression_parameters::algorithm;
for (const auto& algo : {algorithm::lz4_with_dicts, algorithm::zstd_with_dicts}) {
for (const auto& chunk_size_kb : {1, 4, 16}) {
std::vector<int> levels;
if (algo == compressor::algorithm::zstd_with_dicts) {
for (int i = 1; i <= 5; ++i) {
levels.push_back(i);
}
} else {
levels.push_back(1);
}
for (auto level : levels) {
auto algo_name = compression_parameters::algorithm_to_name(algo);
auto m = std::map<sstring, sstring>{
{compression_parameters::CHUNK_LENGTH_KB, std::to_string(chunk_size_kb)},
{compression_parameters::SSTABLE_COMPRESSION, sstring(algo_name)},
};
if (algo == compressor::algorithm::zstd_with_dicts) {
m.insert(decltype(m)::value_type{sstring("compression_level"), sstring(std::to_string(level))});
}
auto params = compression_parameters(std::move(m));
auto ratio_with_no_dict = co_await try_one_compression_config({}, s, params, validation_sample);
auto ratio_with_past_dict = co_await try_one_compression_config(ctx.db.local().get_user_sstables_manager().get_compressor_factory(), s, params, validation_sample);
auto ratio_with_future_dict = co_await try_one_compression_config(dict, s, params, validation_sample);
res.push_back(make_result(algo_name, chunk_size_kb, "none", level, ratio_with_no_dict));
res.push_back(make_result(algo_name, chunk_size_kb, "past", level, ratio_with_past_dict));
res.push_back(make_result(algo_name, chunk_size_kb, "future", level, ratio_with_future_dict));
}
}
}
co_return res;
}
static
future<json::json_return_type>
rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().sstable_compression_dicts) {
apilog.warn("retrain_dict: called before the cluster feature was enabled");
throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);
const auto t_id = ctx.db.local().find_column_family(ks, cf).schema()->id();
constexpr uint64_t chunk_size = 4096;
constexpr uint64_t n_chunks = 4096;
auto sample = co_await ss.local().do_sample_sstables(t_id, chunk_size, n_chunks);
apilog.debug("retrain_dict: got sample with {} blocks", sample.size());
auto dict = co_await ss.local().train_dict(std::move(sample));
apilog.debug("retrain_dict: got dict of size {}", dict.size());
co_await ss.local().publish_new_sstable_dict(t_id, dict, group0_client);
apilog.debug("retrain_dict: published new dict");
co_return json_void();
}
static
future<json::json_return_type>
rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
@@ -1509,21 +1535,23 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
info.version = sstable->get_version();
if (sstable->has_component(sstables::component_type::CompressionInfo)) {
auto& c = sstable->get_compression();
auto cp = sstables::get_sstable_compressor(c);
const auto& cp = sstable->get_compression().get_compressor();
ss::named_maps nm;
nm.group = "compression_parameters";
for (auto& p : cp->options()) {
for (auto& p : cp.options()) {
if (compressor::is_hidden_option_name(p.first)) {
continue;
}
ss::mapper e;
e.key = p.first;
e.value = p.second;
nm.attributes.push(std::move(e));
}
if (!cp->options().contains(compression_parameters::SSTABLE_COMPRESSION)) {
if (!cp.options().contains(compression_parameters::SSTABLE_COMPRESSION)) {
ss::mapper e;
e.key = compression_parameters::SSTABLE_COMPRESSION;
e.value = cp->name();
e.value = sstring(cp.name());
nm.attributes.push(std::move(e));
}
info.extended_properties.push(std::move(nm));
@@ -1610,6 +1638,18 @@ rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::un
co_return sstring(format("{}", ustate));
}
static
future<json::json_return_type>
rest_raft_topology_get_cmd_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto status = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_cmd_status();
});
if (status.active_dst.empty()) {
co_return sstring("none");
}
co_return sstring(fmt::format("{}[{}]: {}", status.current, status.index, fmt::join(status.active_dst, ",")));
}
static
future<json::json_return_type>
rest_move_tablet(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1640,7 +1680,7 @@ rest_add_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss
auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));
auto ks = req->get_query_param("ks");
auto table = req->get_query_param("table");
auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();
auto table_id = validate_table(ctx.db.local(), ks, table);
auto force_str = req->get_query_param("force");
auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));
@@ -1659,7 +1699,7 @@ rest_del_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss
auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));
auto ks = req->get_query_param("ks");
auto table = req->get_query_param("table");
auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();
auto table_id = validate_table(ctx.db.local(), ks, table);
auto force_str = req->get_query_param("force");
auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));
@@ -1736,6 +1776,7 @@ future<json::json_return_type>
rest_get_schema_versions(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
return ss.local().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
res.reserve(result.size());
for (auto e : result) {
sp::mapper_list entry;
entry.key = std::move(e.first);
@@ -1746,6 +1787,36 @@ rest_get_schema_versions(sharded<service::storage_service>& ss, std::unique_ptr<
});
}
static
future<json::json_return_type>
rest_drop_quarantined_sstables(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto keyspace = req->get_query_param("keyspace");
try {
if (!keyspace.empty()) {
keyspace = validate_keyspace(ctx, keyspace);
auto it = req->query_parameters.find("tables");
auto table_infos = parse_table_infos(keyspace, ctx, it != req->query_parameters.end() ? it->second : "");
co_await ctx.db.invoke_on_all([&table_infos](replica::database& db) -> future<> {
return parallel_for_each(table_infos, [&db](const auto& table) -> future<> {
const auto& [table_name, table_id] = table;
return db.find_column_family(table_id).drop_quarantined_sstables();
});
});
} else {
co_await ctx.db.invoke_on_all([](replica::database& db) -> future<> {
return db.get_tables_metadata().parallel_for_each_table([](table_id, lw_shared_ptr<replica::table> t) -> future<> {
return t->drop_quarantined_sstables();
});
});
}
} catch (...) {
apilog.error("drop_quarantined_sstables: failed with exception: {}", std::current_exception());
throw;
}
co_return json_void();
}
// Disambiguate between a function that returns a future and a function that returns a plain value, also
// add std::ref() as a courtesy. Also handles ks_cf_func signatures.
@@ -1767,12 +1838,6 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
static
seastar::httpd::future_json_function
rest_bind(ks_cf_func func, http_context& ctx) {
return wrap_ks_cf(ctx, func);
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
@@ -1841,10 +1906,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));
ss::get_ownership.set(r, rest_bind(rest_get_ownership, ctx, ss));
ss::get_effective_ownership.set(r, rest_bind(rest_get_effective_ownership, ctx, ss));
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
@@ -1852,6 +1920,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::tablet_balancing_enable.set(r, rest_bind(rest_tablet_balancing_enable, ss));
ss::quiesce_topology.set(r, rest_bind(rest_quiesce_topology, ss));
sp::get_schema_versions.set(r, rest_bind(rest_get_schema_versions, ss));
ss::drop_quarantined_sstables.set(r, rest_bind(rest_drop_quarantined_sstables, ctx, ss));
}
void unset_storage_service(http_context& ctx, routes& r) {
@@ -1926,6 +1995,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
ss::raft_topology_get_cmd_status.unset(r);
ss::move_tablet.unset(r);
ss::add_tablet_replica.unset(r);
ss::del_tablet_replica.unset(r);
@@ -1933,6 +2003,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::tablet_balancing_enable.unset(r);
ss::quiesce_topology.unset(r);
sp::get_schema_versions.unset(r);
ss::drop_quarantined_sstables.unset(r);
}
void set_load_meter(http_context& ctx, routes& r, service::load_meter& lm) {

View File

@@ -52,10 +52,11 @@ table_id validate_table(const replica::database& db, sstring ks_name, sstring ta
// containing the description of the respective no_such_column_family error.
// Returns a vector of all table infos given by the parameter, or
// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value);
std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name = "cf");
struct scrub_info {
sstables::compaction_type_options::scrub opts;
sstring keyspace;
@@ -82,4 +83,11 @@ void set_load_meter(http_context& ctx, httpd::routes& r, service::load_meter& lm
void unset_load_meter(http_context& ctx, httpd::routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
// converts string value of boolean parameter into bool
// maps (case insensitively)
// "true", "yes" and "1" into true
// "false", "no" and "0" into false
// otherwise throws runtime_error
bool validate_bool_x(const sstring& param, bool default_value);
} // namespace api

View File

@@ -31,8 +31,7 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
@@ -40,15 +39,8 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {
t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto params = req_params({
std::pair("keyspace", mandatory::yes),
std::pair("cf", mandatory::no),
std::pair("flush_memtables", mandatory::no),
});
params.process(*req);
auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));
auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
@@ -63,8 +55,7 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";

View File

@@ -54,12 +54,12 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
for (const auto host_id: leaving_host_ids) {
eps.insert(g.local().get_address_map().get(host_id));
}
return container_to_vec(eps);
return eps | std::views::transform([] (auto& i) { return fmt::to_string(i); }) | std::ranges::to<std::vector>();
});
ss::get_moving_nodes.set(r, [](const_req req) {
std::unordered_set<sstring> addr;
return container_to_vec(addr);
return addr | std::ranges::to<std::vector>();
});
ss::get_joining_nodes.set(r, [&tm, &g](const_req req) {
@@ -70,15 +70,21 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
for (const auto& [token, host_id]: points) {
eps.insert(g.local().get_address_map().get(host_id));
}
return container_to_vec(eps);
return eps | std::views::transform([] (auto& i) { return fmt::to_string(i); }) | std::ranges::to<std::vector>();
});
ss::get_host_id_map.set(r, [&tm, &g](const_req req) {
std::vector<ss::mapper> res;
auto map = tm.local().get()->get_host_ids() |
std::views::transform([&g] (locator::host_id id) { return std::make_pair(g.local().get_address_map().get(id), id); }) |
std::ranges::to<std::unordered_map>();
return map_to_key_value(std::move(map), res);
if (!g.local().is_enabled()) {
throw std::runtime_error("The gossiper is not ready yet");
}
return tm.local().get()->get_host_ids()
| std::views::transform([&g] (locator::host_id id) {
ss::mapper m;
m.key = fmt::to_string(g.local().get_address_map().get(id));
m.value = fmt::to_string(id);
return m;
})
| std::ranges::to<std::vector<ss::mapper>>();
});
static auto host_or_broadcast = [&tm](const_req req) {

View File

@@ -209,6 +209,11 @@ future<> audit::log(const audit_info* audit_info, service::query_state& query_st
static const sstring anonymous_username("anonymous");
const sstring& username = client_state.user() ? client_state.user()->name.value_or(anonymous_username) : no_username;
socket_address client_ip = client_state.get_client_address().addr();
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Log written: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info->category_string(), cl, error, audit_info->keyspace(),
audit_info->query(), client_ip, audit_info->table(), username);
}
return futurize_invoke(std::mem_fn(&storage_helper::write), _storage_helper_ptr, audit_info, node_ip, client_ip, cl, username, error)
.handle_exception([audit_info, node_ip, client_ip, cl, username, error] (auto ep) {
logger.error("Unexpected exception when writing log with: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {} exception {}",
@@ -219,6 +224,10 @@ future<> audit::log(const audit_info* audit_info, service::query_state& query_st
future<> audit::log_login(const sstring& username, socket_address client_ip, bool error) noexcept {
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Login log written: node_ip {}, client_ip {}, username {}, error {}",
node_ip, client_ip, username, error ? "true" : "false");
}
return futurize_invoke(std::mem_fn(&storage_helper::write_login), _storage_helper_ptr, username, node_ip, client_ip, error)
.handle_exception([username, node_ip, client_ip, error] (auto ep) {
logger.error("Unexpected exception when writing login log with: node_ip {} client_ip {} username {} error {} exception {}",

View File

@@ -33,20 +33,6 @@ namespace audit {
namespace {
future<> syslog_send_helper(net::datagram_channel& sender,
const socket_address& address,
const sstring& msg) {
return sender.send(address, net::packet{msg.data(), msg.size()}).handle_exception([address](auto&& exception_ptr) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
address,
exception_ptr
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
});
}
static auto syslog_address_helper(const db::config& cfg)
{
return cfg.audit_unix_socket_path.is_set()
@@ -54,11 +40,40 @@ static auto syslog_address_helper(const db::config& cfg)
: unix_domain_addr(_PATH_LOG);
}
static std::string json_escape(std::string_view str) {
std::string result;
result.reserve(str.size() * 1.2);
for (auto c : str) {
if (c == '"' || c == '\\') {
result.push_back('\\');
}
result.push_back(c);
}
return result;
}
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
_syslog_address,
e
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
}
}
audit_syslog_storage_helper::audit_syslog_storage_helper(cql3::query_processor& qp, service::migration_manager&) :
_syslog_address(syslog_address_helper(qp.db().get_config())),
_sender(make_unbound_datagram_channel(AF_UNIX)) {
_sender(make_unbound_datagram_channel(AF_UNIX)),
_semaphore(1) {
}
audit_syslog_storage_helper::~audit_syslog_storage_helper() {
@@ -73,10 +88,10 @@ audit_syslog_storage_helper::~audit_syslog_storage_helper() {
*/
future<> audit_syslog_storage_helper::start(const db::config& cfg) {
if (this_shard_id() != 0) {
return make_ready_future();
co_return;
}
return syslog_send_helper(_sender, _syslog_address, "Initializing syslog audit backend.");
co_await syslog_send_helper("Initializing syslog audit backend.");
}
future<> audit_syslog_storage_helper::stop() {
@@ -93,7 +108,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
tm time;
localtime_r(&now, &time);
sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\"",
sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}", category="{}", cl="{}", error="{}", keyspace="{}", query="{}", client_ip="{}", table="{}", username="{}")",
LOG_NOTICE | LOG_USER,
time,
node_ip,
@@ -101,12 +116,12 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
cl,
(error ? "true" : "false"),
audit_info->keyspace(),
audit_info->query(),
json_escape(audit_info->query()),
client_ip,
audit_info->table(),
username);
return syslog_send_helper(_sender, _syslog_address, msg);
co_await syslog_send_helper(msg);
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -117,15 +132,15 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
tm time;
localtime_r(&now, &time);
sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"AUTH\", \"\", \"\", \"\", \"\", \"{}\", \"{}\", \"{}\"",
sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}", category="AUTH", cl="", error="{}", keyspace="", query="", client_ip="{}", table="", username="{}")",
LOG_NOTICE | LOG_USER,
time,
node_ip,
(error ? "true" : "false"),
client_ip,
username,
(error ? "true" : "false"));
username);
co_await syslog_send_helper(_sender, _syslog_address, msg.c_str());
co_await syslog_send_helper(msg.c_str());
}
using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;

View File

@@ -24,6 +24,9 @@ namespace audit {
class audit_syslog_storage_helper : public storage_helper {
socket_address _syslog_address;
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -9,6 +9,7 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -21,6 +22,7 @@ static const class_registrator<
allow_all_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
::service::migration_manager&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -13,6 +13,7 @@
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&) {
}
virtual future<> start() override {

View File

@@ -33,13 +33,14 @@ static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);
, ::service::migration_manager&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();

View File

@@ -10,6 +10,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -31,7 +32,7 @@ class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~certificate_authenticator();
future<> start() override;

View File

@@ -119,9 +119,14 @@ future<> create_legacy_metadata_table_if_missing(
return qs;
}
::service::raft_timeout get_raft_timeout() noexcept {
auto dur = internal_distributed_query_state().get_client_state().get_timeout_config().other_timeout;
return ::service::raft_timeout{.value = lowres_clock::now() + dur};
}
static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,
utils::chunked_vector<canonical_mutation> muts,
::service::group0_guard group0_guard,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
@@ -149,7 +154,7 @@ future<> announce_mutations_with_batching(
});
size_t memory_usage = 0;
std::vector<canonical_mutation> muts;
utils::chunked_vector<canonical_mutation> muts;
// guard has to be taken before we execute code in gen as
// it can do read-before-write and we want announce_mutations
@@ -199,7 +204,7 @@ future<> announce_mutations(
internal_distributed_query_state(),
timestamp,
std::move(values));
std::vector<canonical_mutation> cmuts = {muts.begin(), muts.end()};
utils::chunked_vector<canonical_mutation> cmuts = {muts.begin(), muts.end()};
co_await announce_mutations_with_guard(group0_client, std::move(cmuts), std::move(group0_guard), as, timeout);
}

View File

@@ -17,6 +17,7 @@
#include "types/types.hh"
#include "service/raft/raft_group0_client.hh"
#include "timeout_config.hh"
using namespace std::chrono_literals;
@@ -77,6 +78,8 @@ future<> create_legacy_metadata_table_if_missing(
///
::service::query_state& internal_distributed_query_state() noexcept;
::service::raft_timeout get_raft_timeout() noexcept;
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.

View File

@@ -338,8 +338,7 @@ future<std::vector<cql3::description>> ldap_role_manager::describe_role_grants()
}
future<> ldap_role_manager::ensure_superuser_is_created() {
// ldap is responsible for users
co_return;
return _std_mgr.ensure_superuser_is_created();
}
} // namespace auth

View File

@@ -48,14 +48,14 @@ static const class_registrator<
password_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
::service::migration_manager&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
@@ -63,12 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -117,33 +118,95 @@ future<> password_authenticator::migrate_legacy_metadata() const {
});
}
future<> password_authenticator::create_default_if_missing() {
future<> password_authenticator::legacy_create_default_if_missing() {
SCYLLA_ASSERT(legacy_mode(_qp));
const auto exists = co_await default_role_row_satisfies(_qp, &has_salted_hash, _superuser);
if (exists) {
co_return;
}
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto query = update_row_query();
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no);
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password() {
auto needs_password = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
auto results = co_await _qp.execute_internal(query,
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
// Don't add default password if
// - there is no default superuser
// - there is a superuser with a password.
bool has_default = false;
bool has_superuser_with_password = false;
for (auto& result : *results) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == _superuser) {
has_default = true;
}
if (has_salted_hash(result)) {
has_superuser_with_password = true;
}
}
co_return has_default && !has_superuser_with_password;
};
if (!co_await needs_password()) {
co_return;
}
// We don't want to start operation earlier to avoid quorum requirement in
// a common case.
::service::group0_batch batch(
co_await _group0_client.start_operation(_as, get_raft_timeout()));
// Check again as the state may have changed before we took the guard (batch).
if (!co_await needs_password()) {
co_return;
}
// Set default superuser's password.
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto update_query = update_row_query();
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password_with_retries() {
size_t retries = _migration_manager.get_concurrent_ddl_retries();
while (true) {
try {
co_return co_await maybe_create_default_password();
} catch (const ::service::group0_concurrent_modification& ex) {
plogger.warn("Failed to execute maybe_create_default_password due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
// Log error but don't crash the whole node startup sequence.
plogger.error("Failed to create default superuser password due to guard conflict.");
co_return;
} catch (const ::service::raft_operation_timeout_error& ex) {
plogger.error("Failed to create default superuser password due to exception: {}", ex.what());
co_return;
}
}
}
future<> password_authenticator::start() {
return once_among_shards([this] {
// Verify that at least one hashing scheme is supported.
passwords::detail::verify_scheme(_scheme);
plogger.info("Using password hashing scheme: {}", passwords::detail::prefix_for_scheme(_scheme));
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
@@ -164,11 +227,14 @@ future<> password_authenticator::start() {
migrate_legacy_metadata().get();
return;
}
legacy_create_default_if_missing().get();
}
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
create_default_if_missing().get();
if (!legacy_mode(_qp)) {
_superuser_created_promise.set_value();
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
});
});
@@ -228,7 +294,13 @@ future<authenticated_user> password_authenticator::authenticate(
try {
const std::optional<sstring> salted_hash = co_await get_password_hash(username);
if (!salted_hash || !passwords::check(password, *salted_hash)) {
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash = std::move(salted_hash)]{
return passwords::check(password, *salted_hash);
});
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
@@ -252,7 +324,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
auto maybe_hash = options.credentials.transform([&] (const auto& creds) -> sstring {
return std::visit(make_visitor(
[&] (const password_option& opt) {
return passwords::hash(opt.password, rng_for_salt);
return passwords::hash(opt.password, rng_for_salt, _scheme);
},
[] (const hashed_password_option& opt) {
return opt.hashed_password;
@@ -295,11 +367,11 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(password, rng_for_salt), sstring(role_name)},
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt), sstring(role_name)});
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
}

View File

@@ -15,7 +15,9 @@
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -41,14 +43,17 @@ class password_authenticator : public authenticator {
::service::migration_manager& _migration_manager;
future<> _stopped;
abort_source _as;
std::string _superuser;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~password_authenticator();
@@ -89,7 +94,10 @@ private:
future<> migrate_legacy_metadata() const;
future<> create_default_if_missing();
future<> legacy_create_default_if_missing();
future<> maybe_create_default_password();
future<> maybe_create_default_password_with_retries();
sstring update_row_query() const;
};

View File

@@ -21,18 +21,14 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
scheme identify_best_supported_scheme() {
const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };
// "Random", for testing schemes.
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
for (scheme c : all_schemes) {
const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return c;
}
if (e && (e[0] != '*')) {
return;
}
throw no_supported_schemes();

View File

@@ -21,10 +21,11 @@ class no_supported_schemes : public std::runtime_error {
public:
no_supported_schemes();
};
///
/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so
/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.
/// Apache Cassandra uses a library to provide the bcrypt scheme. In ScyllaDB, we use SHA-512
/// instead of bcrypt for performance and for historical reasons (see scylladb#24524).
/// Currently, SHA-512 is always chosen as the hashing scheme for new passwords, but the other
/// algorithms remain supported for CREATE ROLE WITH HASHED PASSWORD and backward compatibility.
///
enum class scheme {
bcrypt_y,
@@ -51,11 +52,11 @@ sstring generate_random_salt_bytes(RandomNumberEngine& g) {
}
///
/// Test each allowed hashing scheme and report the best supported one on the current system.
/// Test given hashing scheme on the current system.
///
/// \throws \ref no_supported_schemes when none of the known schemes is supported.
/// \throws \ref no_supported_schemes when scheme is unsupported.
///
scheme identify_best_supported_scheme();
void verify_scheme(scheme scheme);
std::string_view prefix_for_scheme(scheme) noexcept;
@@ -67,8 +68,7 @@ std::string_view prefix_for_scheme(scheme) noexcept;
/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.
///
template <typename RandomNumberEngine>
sstring generate_salt(RandomNumberEngine& g) {
static const scheme scheme = identify_best_supported_scheme();
sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
static const sstring prefix = sstring(prefix_for_scheme(scheme));
return prefix + generate_random_salt_bytes(g);
}
@@ -93,8 +93,8 @@ sstring hash_with_salt(const sstring& pass, const sstring& salt);
/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.
///
template <typename RandomNumberEngine>
sstring hash(const sstring& pass, RandomNumberEngine& g) {
return detail::hash_with_salt(pass, detail::generate_salt(g));
sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
return detail::hash_with_salt(pass, detail::generate_salt(g, scheme));
}
///

View File

@@ -193,9 +193,7 @@ service_level_resource_view::service_level_resource_view(const resource &r) {
sstring encode_signature(std::string_view name, std::vector<data_type> args) {
return seastar::format("{}[{}]", name,
fmt::join(args | std::views::transform([] (const data_type t) {
return t->name();
}), "^"));
fmt::join(args | std::views::transform(&abstract_type::name), "^"));
}
std::pair<sstring, std::vector<data_type>> decode_signature(std::string_view encoded_signature) {
@@ -221,9 +219,7 @@ std::pair<sstring, std::vector<data_type>> decode_signature(std::string_view enc
static sstring decoded_signature_string(std::string_view encoded_signature) {
auto [function_name, arg_types] = decode_signature(encoded_signature);
return seastar::format("{}({})", cql3::util::maybe_quote(sstring(function_name)),
fmt::join(arg_types | std::views::transform([] (data_type t) {
return t->cql3_type_name();
}), ", "));
fmt::join(arg_types | std::views::transform(&abstract_type::cql3_type_name), ", "));
}
resource make_functions_resource(const cql3::functions::function& f) {

View File

@@ -34,9 +34,10 @@ static const class_registrator<
saslauthd_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
::service::migration_manager&,
utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -11,6 +11,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ namespace auth {
class saslauthd_authenticator : public authenticator {
sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.
public:
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
future<> start() override;

View File

@@ -47,6 +47,7 @@
#include "data_dictionary/keyspace_metadata.hh"
#include "service/storage_service.hh"
#include "service_permit.hh"
#include "utils/managed_string.hh"
using namespace std::chrono_literals;
@@ -83,7 +84,6 @@ private:
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_update_tablet_metadata(const locator::tablet_metadata_change_hint&) override {}
void on_drop_keyspace(const sstring& ks_name) override {
if (!legacy_mode(_qp)) {
@@ -187,14 +187,15 @@ service::service(
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc,
maintenance_socket_enabled used_by_maintenance_socket)
maintenance_socket_enabled used_by_maintenance_socket,
utils::alien_worker& hashing_worker)
: service(
std::move(c),
qp,
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, hashing_worker),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm),
used_by_maintenance_socket) {
}
@@ -240,6 +241,13 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
});
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
_permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);
co_await once_among_shards([this] {
@@ -468,12 +476,14 @@ future<std::vector<cql3::description>> service::describe_roles(bool with_hashed_
const bool can_login = co_await _role_manager->can_login(role);
const bool is_superuser = co_await _role_manager->is_superuser(role);
sstring create_statement = produce_create_statement(formatted_role_name, maybe_hashed_password, can_login, is_superuser);
result.push_back(cql3::description {
// Roles do not belong to any keyspace.
.keyspace = std::nullopt,
.type = "role",
.name = role,
.create_statement = produce_create_statement(formatted_role_name, maybe_hashed_password, can_login, is_superuser)
.create_statement = managed_string(create_statement)
});
}
@@ -614,19 +624,21 @@ future<std::vector<cql3::description>> service::describe_permissions() const {
for (const auto& permissions : permission_list) {
for (const auto& permission : permissions.permissions) {
sstring create_statement = describe_resource_kind(permission, permissions.resource, permissions.role_name);
result.push_back(cql3::description {
// Permission grants do not belong to any keyspace.
.keyspace = std::nullopt,
.type = "grant_permission",
.name = permissions.role_name,
.create_statement = describe_resource_kind(permission, permissions.resource, permissions.role_name)
.create_statement = managed_string(create_statement)
});
}
co_await coroutine::maybe_yield();
}
std::ranges::sort(result, std::less<>{}, [] (const cql3::description& desc) noexcept {
std::ranges::sort(result, std::less<>{}, [] (const cql3::description& desc) {
return std::make_tuple(std::ref(desc.name), std::ref(*desc.create_statement));
});
@@ -885,7 +897,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob(col.name_as_text())));
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}

View File

@@ -26,6 +26,7 @@
#include "cql3/description.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
#include "service/maintenance_mode.hh"
@@ -126,7 +127,8 @@ public:
::service::migration_notifier&,
::service::migration_manager&,
const service_config&,
maintenance_socket_enabled);
maintenance_socket_enabled,
utils::alien_worker&);
future<> start(::service::migration_manager&, db::system_keyspace&);

View File

@@ -9,6 +9,7 @@
#include "auth/standard_role_manager.hh"
#include <optional>
#include <stdexcept>
#include <unordered_set>
#include <vector>
@@ -28,6 +29,7 @@
#include "cql3/util.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
#include <seastar/core/loop.hh>
#include <seastar/coroutine/maybe_yield.hh>
@@ -35,6 +37,7 @@
#include "utils/class_registrator.hh"
#include "service/migration_manager.hh"
#include "password_authenticator.hh"
#include "utils/managed_string.hh"
namespace auth {
@@ -126,7 +129,7 @@ static future<record> require_record(cql3::query_processor& qp, std::string_view
}
static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob_unfragmented("can_login")).is_null());
}
standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
@@ -178,7 +181,8 @@ future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const
_migration_manager)).discard_result();
}
future<> standard_role_manager::create_default_role_if_missing() {
future<> standard_role_manager::legacy_create_default_role_if_missing() {
SCYLLA_ASSERT(legacy_mode(_qp));
try {
const auto exists = co_await default_role_row_satisfies(_qp, &has_can_login, _superuser);
if (exists) {
@@ -188,16 +192,12 @@ future<> standard_role_manager::create_default_role_if_missing() {
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
@@ -205,6 +205,60 @@ future<> standard_role_manager::create_default_role_if_missing() {
}
}
future<> standard_role_manager::maybe_create_default_role() {
auto has_superuser = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
auto results = co_await _qp.execute_internal(query, db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
for (const auto& result : *results) {
if (has_can_login(result)) {
co_return true;
}
}
co_return false;
};
if (co_await has_superuser()) {
co_return;
}
// We don't want to start operation earlier to avoid quorum requirement in
// a common case.
::service::group0_batch batch(
co_await _group0_client.start_operation(_as, get_raft_timeout()));
// Check again as the state may have changed before we took the guard (batch).
if (co_await has_superuser()) {
co_return;
}
// There is no superuser which has can_login field - create default role.
// Note that we don't check if can_login is set to true.
const sstring insert_query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await collect_mutations(_qp, batch, insert_query, {_superuser});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
log.info("Created default superuser role '{}'.", _superuser);
}
future<> standard_role_manager::maybe_create_default_role_with_retries() {
size_t retries = _migration_manager.get_concurrent_ddl_retries();
while (true) {
try {
co_return co_await maybe_create_default_role();
} catch (const ::service::group0_concurrent_modification& ex) {
log.warn("Failed to execute maybe_create_default_role due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
// Log error but don't crash the whole node startup sequence.
log.error("Failed to create default superuser role due to guard conflict.");
co_return;
} catch (const ::service::raft_operation_timeout_error& ex) {
log.error("Failed to create default superuser role due to exception: {}", ex.what());
co_return;
}
}
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() {
@@ -266,10 +320,13 @@ future<> standard_role_manager::start() {
co_await migrate_legacy_metadata();
co_return;
}
co_await legacy_create_default_role_if_missing();
}
co_await create_default_role_if_missing();
if (!legacy) {
_superuser_created_promise.set_value();
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
};
@@ -619,6 +676,12 @@ future<role_set> standard_role_manager::query_all() {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {
if (legacy_mode(_qp)) {
throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");
}
}
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
@@ -722,18 +785,20 @@ future<std::vector<cql3::description>> standard_role_manager::describe_role_gran
const auto formatted_grantee = cql3::util::maybe_quote(grantee_role);
const auto formatted_granted = cql3::util::maybe_quote(granted_role);
sstring create_statement = seastar::format("GRANT {} TO {};", formatted_granted, formatted_grantee);
result.push_back(cql3::description {
// Role grants do not belong to any keyspace.
.keyspace = std::nullopt,
.type = "grant_role",
.name = granted_role,
.create_statement = seastar::format("GRANT {} TO {};", formatted_granted, formatted_grantee)
.create_statement = managed_string(create_statement)
});
co_await coroutine::maybe_yield();
}
std::ranges::sort(result, std::less<>{}, [] (const cql3::description& desc) noexcept {
std::ranges::sort(result, std::less<>{}, [] (const cql3::description& desc) {
return std::make_tuple(std::ref(desc.name), std::ref(*desc.create_statement));
});

View File

@@ -95,7 +95,10 @@ private:
future<> migrate_legacy_metadata();
future<> create_default_role_if_missing();
future<> legacy_create_default_role_if_missing();
future<> maybe_create_default_role();
future<> maybe_create_default_role_with_retries();
future<> create_or_replace(std::string_view role_name, const role_config&, ::service::group0_batch&);

View File

@@ -37,8 +37,8 @@ class transitional_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm)) {
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, hashing_worker)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
@@ -239,7 +239,8 @@ static const class_registrator<
auth::transitional_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
::service::migration_manager&,
utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,

View File

@@ -35,8 +35,9 @@ inline bytes_view to_bytes_view(std::string_view view) {
}
struct fmt_hex {
const bytes_view& v;
fmt_hex(const bytes_view& v) noexcept : v(v) {}
std::span<const std::byte> v;
fmt_hex(const bytes_view& v) noexcept : v(std::as_bytes(std::span(v))) {}
fmt_hex(std::span<const std::byte> v) noexcept : v(v) {}
};
bytes from_hex(std::string_view s);

View File

@@ -139,7 +139,7 @@ private:
// size must not be zero.
[[gnu::always_inline]]
value_type* alloc(size_type size) {
if (__builtin_expect(size <= current_space_left(), true)) {
if (size <= current_space_left()) [[likely]] {
auto ret = _current->data + _current->frag_size;
_current->frag_size += size;
_size += size;
@@ -249,7 +249,7 @@ public:
}
auto this_size = std::min(v.size(), size_t(current_space_left()));
if (__builtin_expect(this_size, true)) {
if (this_size) [[likely]] {
memcpy(_current->data + _current->frag_size, v.begin(), this_size);
_current->frag_size += this_size;
_size += this_size;
@@ -268,6 +268,14 @@ public:
write(bytes_view(reinterpret_cast<const signed char*>(ptr), size));
}
// Writes the fragmented view
template<FragmentedView View>
void write(View v) {
for (bytes_view f : fragment_range(v)) {
write(f);
}
}
bool is_linearized() const {
return !_begin || !_begin->next;
}

View File

@@ -23,6 +23,10 @@ class cdc_extension : public schema_extension {
public:
static constexpr auto NAME = "cdc";
// cdc_extension was written before schema_extension was deprecated, so support it
// without warnings
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
cdc_extension() = default;
cdc_extension(const options& opts) : _cdc_options(opts) {}
explicit cdc_extension(std::map<sstring, sstring> tags) : _cdc_options(std::move(tags)) {}
@@ -30,6 +34,7 @@ public:
explicit cdc_extension(const sstring& s) {
throw std::logic_error("Cannot create cdc info from string");
}
#pragma clang diagnostic pop
bytes serialize() const override {
return ser::serialize_to_buffer<bytes>(_cdc_options.to_map());
}

View File

@@ -12,7 +12,7 @@
#include "sstables/key.hh"
#include "utils/class_registrator.hh"
#include "cdc/generation.hh"
#include "keys.hh"
#include "keys/keys.hh"
namespace cdc {

View File

@@ -16,7 +16,7 @@
#include "gms/endpoint_state.hh"
#include "gms/versioned_value.hh"
#include "keys.hh"
#include "keys/keys.hh"
#include "replica/database.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
@@ -39,12 +39,12 @@
extern logging::logger cdc_log;
static int get_shard_count(const gms::inet_address& endpoint, const gms::gossiper& g) {
static int get_shard_count(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);
return ep_state ? std::stoi(ep_state->value()) : -1;
}
static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const gms::gossiper& g) {
static unsigned get_sharding_ignore_msb(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);
return ep_state ? std::stoi(ep_state->value()) : 0;
}
@@ -198,7 +198,7 @@ static std::vector<stream_id> create_stream_ids(
}
bool should_propose_first_generation(const locator::host_id& my_host_id, const gms::gossiper& g) {
return g.for_each_endpoint_state_until([&] (const gms::inet_address&, const gms::endpoint_state& eps) {
return g.for_each_endpoint_state_until([&] (const gms::endpoint_state& eps) {
return stop_iteration(my_host_id < eps.get_host_id());
}) == stop_iteration::no;
}
@@ -365,6 +365,9 @@ cdc::topology_description make_new_generation_description(
const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,
const locator::token_metadata_ptr tmptr) {
const auto tokens = get_tokens(bootstrap_tokens, tmptr);
if (tokens.empty()) {
on_internal_error(cdc_log, "Attempted to create a CDC generation from an empty list of tokens");
}
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
@@ -402,9 +405,8 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
throw std::runtime_error(
format("Can't find endpoint for token {}", end));
}
const auto ep = _gossiper.get_address_map().get(*endpoint);
auto sc = get_shard_count(ep, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(ep, _gossiper)};
auto sc = get_shard_count(*endpoint, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};
}
};
@@ -463,7 +465,7 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::endpoint_state& eps) {
static std::optional<cdc::generation_id> get_generation_id_for(const locator::host_id& endpoint, const gms::endpoint_state& eps) {
const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!gen_id_ptr) {
return std::nullopt;
@@ -841,18 +843,18 @@ future<> generation_service::leave_ring() {
co_await _gossiper.unregister_(shared_from_this());
}
future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, ep_state->get_application_state_map(), pid);
future<> generation_service::on_join(gms::inet_address ep, locator::host_id id, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, id, ep_state->get_application_state_map(), pid);
}
future<> generation_service::on_change(gms::inet_address ep, const gms::application_state_map& states, gms::permit_id pid) {
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (_raft_topology_change_enabled()) {
return make_ready_future<>();
}
return on_application_state_change(ep, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, const gms::versioned_value& v, gms::permit_id) {
return on_application_state_change(ep, id, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, locator::host_id id, const gms::versioned_value& v, gms::permit_id) {
auto gen_id = gms::versioned_value::cdc_generation_id_from_string(v.value());
cdc_log.debug("Endpoint: {}, CDC generation ID change: {}", ep, gen_id);
@@ -867,7 +869,8 @@ future<> generation_service::check_and_repair_cdc_streams() {
}
std::optional<cdc::generation_id> latest = _gen_id;
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& state) {
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& state) {
auto addr = state.get_host_id();
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
return;
@@ -1066,8 +1069,8 @@ future<> generation_service::legacy_scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<cdc::generation_id> latest;
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(node, eps);
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(eps.get_host_id(), eps);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}

View File

@@ -110,13 +110,8 @@ public:
return _cdc_metadata;
}
virtual future<> on_alive(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_dead(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_remove(gms::inet_address, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_restart(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_join(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, const gms::application_state_map&, gms::permit_id) override;
virtual future<> on_join(gms::inet_address, locator::host_id id, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, locator::host_id id, const gms::application_state_map&, gms::permit_id) override;
future<> check_and_repair_cdc_streams();

View File

@@ -158,7 +158,7 @@ public:
});
}
void on_before_create_column_family(const keyspace_metadata& ksm, const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
void on_before_create_column_family(const keyspace_metadata& ksm, const schema& schema, utils::chunked_vector<mutation>& mutations, api::timestamp_type timestamp) override {
if (schema.cdc_options().enabled()) {
auto& db = _ctxt._proxy.get_db().local();
auto logname = log_name(schema.cf_name());
@@ -175,7 +175,7 @@ public:
}
}
void on_before_update_column_family(const schema& new_schema, const schema& old_schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
void on_before_update_column_family(const schema& new_schema, const schema& old_schema, utils::chunked_vector<mutation>& mutations, api::timestamp_type timestamp) override {
bool is_cdc = new_schema.cdc_options().enabled();
bool was_cdc = old_schema.cdc_options().enabled();
@@ -216,7 +216,7 @@ public:
}
}
void on_before_drop_column_family(const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
void on_before_drop_column_family(const schema& schema, utils::chunked_vector<mutation>& mutations, api::timestamp_type timestamp) override {
auto logname = log_name(schema.cf_name());
auto& db = _ctxt._proxy.get_db().local();
auto has_cdc_log = db.has_schema(schema.ks_name(), logname);
@@ -231,15 +231,15 @@ public:
}
}
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>> augment_mutation_call(
future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>> augment_mutation_call(
lowres_clock::time_point timeout,
std::vector<mutation>&& mutations,
utils::chunked_vector<mutation>&& mutations,
tracing::trace_state_ptr tr_state,
db::consistency_level write_cl
);
template<typename Iter>
future<> append_mutations(Iter i, Iter e, schema_ptr s, lowres_clock::time_point, std::vector<mutation>&);
future<> append_mutations(Iter i, Iter e, schema_ptr s, lowres_clock::time_point, utils::chunked_vector<mutation>&);
private:
static void check_for_attempt_to_create_nested_cdc_log(replica::database& db, const schema& schema) {
@@ -960,8 +960,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given value for the given row.
void set_value(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes_view& value) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
}
// Each regular and static column in the base schema has a corresponding column in the log schema
@@ -969,7 +973,13 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to `true` for the given row. If not called, the column will be `null`.
void set_deleted(const clustering_key& log_ck, const column_definition& base_cdef) {
_log_mut.set_cell(log_ck, log_data_column_deleted_name_bytes(base_cdef.name()), data_value(true), _ts, _ttl);
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*log_cdef.type, _ts, log_cdef.type->decompose(true), _ttl));
}
// Each regular and static non-atomic column in the base schema has a corresponding column in the log schema
@@ -978,7 +988,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given set of keys for the given row.
void set_deleted_elements(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes& deleted_elements) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*log_cdef.type, _ts, deleted_elements, _ttl));
}
@@ -1461,7 +1476,7 @@ private:
row_states_map _clustering_row_states;
cell_map _static_row_state;
std::vector<mutation> _result_mutations;
utils::chunked_vector<mutation> _result_mutations;
std::optional<log_mutation_builder> _builder;
// When enabled, process_change will update _clustering_row_states and _static_row_state
@@ -1591,8 +1606,8 @@ public:
// Takes and returns generated cdc log mutations and associated statistics about parts touched during transformer's lifetime.
// The `transformer` object on which this method was called on should not be used anymore.
std::tuple<std::vector<mutation>, stats::part_type_set> finish() && {
return std::make_pair<std::vector<mutation>, stats::part_type_set>(std::move(_result_mutations), std::move(_touched_parts));
std::tuple<utils::chunked_vector<mutation>, stats::part_type_set> finish() && {
return std::make_pair<utils::chunked_vector<mutation>, stats::part_type_set>(std::move(_result_mutations), std::move(_touched_parts));
}
static db::timeout_clock::time_point default_timeout() {
@@ -1763,8 +1778,8 @@ public:
};
template <typename Func>
future<std::vector<mutation>>
transform_mutations(std::vector<mutation>& muts, decltype(muts.size()) batch_size, Func&& f) {
future<utils::chunked_vector<mutation>>
transform_mutations(utils::chunked_vector<mutation>& muts, decltype(muts.size()) batch_size, Func&& f) {
return parallel_for_each(
boost::irange(static_cast<decltype(muts.size())>(0), muts.size(), batch_size),
std::forward<Func>(f))
@@ -1773,8 +1788,8 @@ transform_mutations(std::vector<mutation>& muts, decltype(muts.size()) batch_siz
} // namespace cdc
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout, utils::chunked_vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
// we do all this because in the case of batches, we can have mixed schemas.
auto e = mutations.end();
auto i = std::find_if(mutations.begin(), e, [](const mutation& m) {
@@ -1782,14 +1797,14 @@ cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout,
});
if (i == e) {
return make_ready_future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>(std::make_tuple(std::move(mutations), lw_shared_ptr<cdc::operation_result_tracker>()));
return make_ready_future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>(std::make_tuple(std::move(mutations), lw_shared_ptr<cdc::operation_result_tracker>()));
}
tracing::trace(tr_state, "CDC: Started generating mutations for log rows");
mutations.reserve(2 * mutations.size());
return do_with(std::move(mutations), service::query_state(service::client_state::for_internal_calls(), empty_service_permit()), operation_details{},
[this, tr_state = std::move(tr_state), write_cl] (std::vector<mutation>& mutations, service::query_state& qs, operation_details& details) {
[this, tr_state = std::move(tr_state), write_cl] (utils::chunked_vector<mutation>& mutations, service::query_state& qs, operation_details& details) {
return transform_mutations(mutations, 1, [this, &mutations, &qs, tr_state = tr_state, &details, write_cl] (int idx) mutable {
auto& m = mutations[idx];
auto s = m.schema();
@@ -1849,21 +1864,26 @@ cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout,
tracing::trace(tr_state, "CDC: Generated {} log mutations from {}", generated_count, mutations[idx].decorated_key());
details.touched_parts.add(touched_parts);
});
}).then([this, tr_state, &details](std::vector<mutation> mutations) {
}).then([this, tr_state, &details](utils::chunked_vector<mutation> mutations) {
tracing::trace(tr_state, "CDC: Finished generating all log mutations");
auto tracker = make_lw_shared<cdc::operation_result_tracker>(_ctxt._proxy.get_cdc_stats(), details);
return make_ready_future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>(std::make_tuple(std::move(mutations), std::move(tracker)));
return make_ready_future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>(std::make_tuple(std::move(mutations), std::move(tracker)));
});
});
}
bool cdc::cdc_service::needs_cdc_augmentation(const std::vector<mutation>& mutations) const {
bool cdc::cdc_service::needs_cdc_augmentation(const utils::chunked_vector<mutation>& mutations) const {
return std::any_of(mutations.begin(), mutations.end(), [](const mutation& m) {
return m.schema()->cdc_options().enabled();
});
}
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, utils::chunked_vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
if (utils::get_local_injector().enter("sleep_before_cdc_augmentation")) {
return seastar::sleep(std::chrono::milliseconds(100)).then([this, timeout, mutations = std::move(mutations), tr_state = std::move(tr_state), write_cl] () mutable {
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
});
}
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
}

View File

@@ -75,13 +75,13 @@ public:
// appropriate augments to set the log entries.
// Iff post-image is enabled for any of these, a non-empty callback is also
// returned to be invoked post the mutation query.
future<std::tuple<std::vector<mutation>, lw_shared_ptr<operation_result_tracker>>> augment_mutation_call(
future<std::tuple<utils::chunked_vector<mutation>, lw_shared_ptr<operation_result_tracker>>> augment_mutation_call(
lowres_clock::time_point timeout,
std::vector<mutation>&& mutations,
utils::chunked_vector<mutation>&& mutations,
tracing::trace_state_ptr tr_state,
db::consistency_level write_cl
);
bool needs_cdc_augmentation(const std::vector<mutation>&) const;
bool needs_cdc_augmentation(const utils::chunked_vector<mutation>&) const;
};
struct db_context final {

View File

@@ -20,8 +20,6 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|AARCH64")
set(kmip_arch "aarch64")
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64|x86_64")
set(kmip_arch "64")
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "(powerpc|ppc)64le")
set(kmip_arch "ppc64le")
endif()
set(kmip_ROOT "${PROJECT_SOURCE_DIR}/kmipc/kmipc-${kmip_ver}-${kmip_distrib}_${kmip_arch}")

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_COVERAGE
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_COVERAGE
update_build_flags(Coverage
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "g")

View File

@@ -1,6 +1,6 @@
set(OptimizationLevel "g")
update_cxx_flags(CMAKE_CXX_FLAGS_DEBUG
update_build_flags(Debug
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL ${OptimizationLevel})

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_DEV
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_DEV
update_build_flags(Dev
OPTIMIZATION_LEVEL "2")
set(scylla_build_mode_Dev "dev")

View File

@@ -8,7 +8,7 @@ set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_RELWITHDEBINFO
update_build_flags(RelWithDebInfo
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "3")

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_SANITIZE
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_SANITIZE
update_build_flags(Sanitize
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "s")

View File

@@ -72,7 +72,7 @@ function(get_padded_dynamic_linker_option output length)
ERROR_VARIABLE driver_command_line
ERROR_STRIP_TRAILING_WHITESPACE)
# extract the argument for the "-dynamic-linker" option
if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"? \"?([^ \"]*)\"? .*")
if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"?[ =]\"?([^ \"]*)\"?[ \n].*")
set(dynamic_linker ${CMAKE_MATCH_1})
else()
message(FATAL_ERROR "Unable to find ${dynamic_linker_option} in driver-generated command: "
@@ -80,7 +80,7 @@ function(get_padded_dynamic_linker_option output length)
endif()
# prefixing a path with "/"s does not actually change it means
pad_at_begin(padded_dynamic_linker "/" "${dynamic_linker}" ${length})
set(${output} "${dynamic_linker_option}=${padded_dynamic_linker}" PARENT_SCOPE)
set(${output} "--dynamic-linker=${padded_dynamic_linker}" PARENT_SCOPE)
endfunction()
# We want to strip the absolute build paths from the binary,
@@ -135,7 +135,7 @@ function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)
endif()
endfunction()
macro(update_cxx_flags flags)
macro(update_build_flags config)
cmake_parse_arguments (
parsed_args
"WITH_DEBUG_INFO"
@@ -145,11 +145,22 @@ macro(update_cxx_flags flags)
if(NOT DEFINED parsed_args_OPTIMIZATION_LEVEL)
message(FATAL_ERROR "OPTIMIZATION_LEVEL is missing")
endif()
string(APPEND ${flags}
string(TOUPPER ${config} CONFIG)
set(cxx_flags "CMAKE_CXX_FLAGS_${CONFIG}")
set(linker_flags "CMAKE_EXE_LINKER_FLAGS_${CONFIG}")
string(APPEND ${cxx_flags}
" -O${parsed_args_OPTIMIZATION_LEVEL}")
if(parsed_args_WITH_DEBUG_INFO)
string(APPEND ${flags} " -g -gz")
string(APPEND ${cxx_flags} " -g -gz")
else()
# If Scylla is compiled without debug info, strip the debug symbols from
# the result in case one of the linked static libraries happens to have
# some debug symbols. See issue #23834.
string(APPEND ${linker_flags} " -Wl,--strip-debug")
endif()
unset(CONFIG)
unset(cxx_flags)
unset(linker_flags)
endmacro()
set(pgo_opts "")
@@ -283,7 +294,7 @@ else()
# that. The 512 includes the null at the end, hence the 511 below.
get_padded_dynamic_linker_option(dynamic_linker_option 511)
endif()
add_link_options("${dynamic_linker_option}")
add_link_options("LINKER:${dynamic_linker_option}")
if(Scylla_ENABLE_LTO)
include(CheckIPOSupported)

View File

@@ -54,6 +54,62 @@
#include "replica/database.hh"
#include "timestamp.hh"
can_gc_fn always_gc = [] (tombstone, is_shadowable) { return true; };
can_gc_fn never_gc = [] (tombstone, is_shadowable) { return false; };
max_purgeable_fn can_always_purge = [] (const dht::decorated_key&, is_shadowable) -> max_purgeable { return max_purgeable(api::max_timestamp); };
max_purgeable_fn can_never_purge = [] (const dht::decorated_key&, is_shadowable) -> max_purgeable { return max_purgeable(api::min_timestamp); };
max_purgeable& max_purgeable::combine(max_purgeable other) {
if (!other) {
return *this;
}
if (!*this) {
*this = std::move(other);
return *this;
}
if (_timestamp > other._timestamp) {
_source = other._source;
_timestamp = other._timestamp;
}
if (_expiry_threshold && other._expiry_threshold) {
_expiry_threshold = std::min(*_expiry_threshold, *other._expiry_threshold);
} else {
_expiry_threshold = std::nullopt;
}
return *this;
}
max_purgeable::can_purge_result max_purgeable::can_purge(tombstone t) const {
if (!*this) {
return { };
}
return {
.can_purge = (t.deletion_time < _expiry_threshold.value_or(gc_clock::time_point::min()) || t.timestamp < _timestamp),
.timestamp_source = _source,
};
}
auto fmt::formatter<max_purgeable::timestamp_source>::format(max_purgeable::timestamp_source s, fmt::format_context& ctx) const -> decltype(ctx.out()) {
switch (s) {
case max_purgeable::timestamp_source::none:
return format_to(ctx.out(), "none");
case max_purgeable::timestamp_source::memtable_possibly_shadowing_data:
return format_to(ctx.out(), "memtable_possibly_shadowing_data");
case max_purgeable::timestamp_source::other_sstables_possibly_shadowing_data:
return format_to(ctx.out(), "other_sstables_possibly_shadowing_data");
}
}
auto fmt::formatter<max_purgeable>::format(max_purgeable mp, fmt::format_context& ctx) const -> decltype(ctx.out()) {
const sstring expiry_str = mp.expiry_threshold() ? fmt::format("{}", mp.expiry_threshold()->time_since_epoch().count()) : "nullopt";
return format_to(ctx.out(), "max_purgeable{{timestamp={}, expiry_treshold={}, source={}}}", mp.timestamp(), expiry_str, mp.source());
}
namespace sstables {
bool is_eligible_for_compaction(const shared_sstable& sst) noexcept {
@@ -135,20 +191,25 @@ std::string_view to_string(compaction_type_options::scrub::quarantine_mode quara
return "(invalid)";
}
static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,
static max_purgeable get_max_purgeable_timestamp(const compaction_group_view& table_s, sstable_set::incremental_selector& selector,
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks,
const api::timestamp_type compacting_max_timestamp, const bool gc_check_only_compacting_sstables, const is_shadowable is_shadowable) {
if (!table_s.tombstone_gc_enabled()) [[unlikely]] {
return api::min_timestamp;
clogger.trace("get_max_purgeable_timestamp {}.{}: tombstone_gc_enabled=false, returning min_timestamp",
table_s.schema()->ks_name(), table_s.schema()->cf_name());
return max_purgeable(api::min_timestamp);
}
auto timestamp = api::max_timestamp;
if (gc_check_only_compacting_sstables) {
// If gc_check_only_compacting_sstables is enabled, do not
// check memtables and other sstables not being compacted.
return timestamp;
clogger.trace("get_max_purgeable_timestamp {}.{}: gc_check_only_compacting_sstables=true, returning max_timestamp",
table_s.schema()->ks_name(), table_s.schema()->cf_name());
return max_purgeable(timestamp);
}
auto source = max_purgeable::timestamp_source::none;
api::timestamp_type memtable_min_timestamp;
if (is_shadowable) {
// For shadowable tombstones, check the minimum live row_marker timestamp
@@ -166,7 +227,8 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
// See https://github.com/scylladb/scylladb/issues/20423
memtable_min_timestamp = table_s.min_memtable_live_timestamp();
}
clogger.trace("memtable_min_timestamp={} compacting_max_timestamp={} memtable_has_key={} is_shadowable={} min_memtable_live_timestamp={} min_memtable_live_row_marker_timestamp={}",
clogger.trace("get_max_purgeable_timestamp {}.{}: memtable_min_timestamp={} compacting_max_timestamp={} memtable_has_key={} is_shadowable={} min_memtable_live_timestamp={} min_memtable_live_row_marker_timestamp={}",
table_s.schema()->ks_name(), table_s.schema()->cf_name(),
memtable_min_timestamp, compacting_max_timestamp, table_s.memtable_has_key(dk), is_shadowable, table_s.min_memtable_live_timestamp(), table_s.min_memtable_live_row_marker_timestamp());
// Use memtable timestamp if it contains live data older than the sstables being compacted,
// and if the memtable also contains the key we're calculating max purgeable timestamp for.
@@ -174,6 +236,7 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
// newer data.
if (memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
timestamp = memtable_min_timestamp;
source = max_purgeable::timestamp_source::memtable_possibly_shadowing_data;
}
std::optional<utils::hashed_key> hk;
for (auto&& sst : boost::range::join(selector.select(dk).sstables, table_s.compacted_undeleted_sstables())) {
@@ -217,12 +280,13 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
if (sst->filter_has_key(*hk)) {
bloom_filter_checks++;
timestamp = min_timestamp;
source = max_purgeable::timestamp_source::other_sstables_possibly_shadowing_data;
}
}
return timestamp;
return max_purgeable(timestamp, source);
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
static std::vector<shared_sstable> get_uncompacting_sstables(const compaction_group_view& table_s, std::vector<shared_sstable> sstables) {
auto sstable_set = table_s.sstable_set_for_tombstone_gc();
auto all_sstables = *sstable_set->all() | std::ranges::to<std::vector>();
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
@@ -235,17 +299,23 @@ static std::vector<shared_sstable> get_uncompacting_sstables(const table_state&
return not_compacted_sstables;
}
static std::vector<basic_info> extract_basic_info_from_sstables(const std::vector<shared_sstable>& sstables) {
return sstables | std::views::transform([] (auto&& sst) {
return sstables::basic_info{.generation = sst->generation(), .origin = sst->get_origin(), .size = sst->bytes_on_disk()};
}) | std::ranges::to<std::vector<basic_info>>();
}
class compaction;
class compaction_write_monitor final : public sstables::write_monitor, public backlog_write_progress_manager {
sstables::shared_sstable _sst;
table_state& _table_s;
compaction_group_view& _table_s;
const sstables::writer_offset_tracker* _tracker = nullptr;
uint64_t _progress_seen = 0;
api::timestamp_type _maximum_timestamp;
unsigned _sstable_level;
public:
compaction_write_monitor(sstables::shared_sstable sst, table_state& table_s, api::timestamp_type max_timestamp, unsigned sstable_level)
compaction_write_monitor(sstables::shared_sstable sst, compaction_group_view& table_s, api::timestamp_type max_timestamp, unsigned sstable_level)
: _sst(sst)
, _table_s(table_s)
, _maximum_timestamp(max_timestamp)
@@ -381,7 +451,7 @@ using use_backlog_tracker = bool_class<class use_backlog_tracker_tag>;
struct compaction_read_monitor_generator final : public read_monitor_generator {
class compaction_read_monitor final : public sstables::read_monitor, public backlog_read_progress_manager {
sstables::shared_sstable _sst;
table_state& _table_s;
compaction_group_view& _table_s;
const sstables::reader_position_tracker* _tracker = nullptr;
uint64_t _last_position_seen = 0;
use_backlog_tracker _use_backlog_tracker;
@@ -414,7 +484,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
_sst = {};
}
compaction_read_monitor(sstables::shared_sstable sst, table_state& table_s, use_backlog_tracker use_backlog_tracker)
compaction_read_monitor(sstables::shared_sstable sst, compaction_group_view& table_s, use_backlog_tracker use_backlog_tracker)
: _sst(std::move(sst)), _table_s(table_s), _use_backlog_tracker(use_backlog_tracker) { }
~compaction_read_monitor() {
@@ -433,7 +503,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
return p.first->second;
}
explicit compaction_read_monitor_generator(table_state& table_s, use_backlog_tracker use_backlog_tracker = use_backlog_tracker::yes)
explicit compaction_read_monitor_generator(compaction_group_view& table_s, use_backlog_tracker use_backlog_tracker = use_backlog_tracker::yes)
: _table_s(table_s), _use_backlog_tracker(use_backlog_tracker) {}
uint64_t compacted() const {
@@ -449,7 +519,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
}
}
private:
table_state& _table_s;
compaction_group_view& _table_s;
std::unordered_map<generation_type, compaction_read_monitor> _generated_monitors;
use_backlog_tracker _use_backlog_tracker;
@@ -477,12 +547,13 @@ uint64_t compaction_progress_monitor::get_progress() const {
class compaction {
protected:
compaction_data& _cdata;
table_state& _table_s;
compaction_group_view& _table_s;
const compaction_sstable_creator_fn _sstable_creator;
const schema_ptr _schema;
const reader_permit _permit;
std::vector<shared_sstable> _sstables;
std::vector<generation_type> _input_sstable_generations;
std::vector<basic_info> _input_sstables_basic_info;
// Unused sstables are tracked because if compaction is interrupted we can only delete them.
// Deleting used sstables could potentially result in data loss.
std::unordered_set<shared_sstable> _new_partial_sstables;
@@ -501,6 +572,7 @@ protected:
double _estimated_droppable_tombstone_ratio = 0;
uint64_t _bloom_filter_checks = 0;
combined_reader_statistics _reader_statistics;
tombstone_purge_stats _tombstone_purge_stats;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
const bool _can_split_large_partition = false;
@@ -525,6 +597,7 @@ protected:
utils::observable<> _stop_request_observable;
// optional tombstone_gc_state that is used when gc has to check only the compacting sstables to collect tombstones.
std::optional<tombstone_gc_state> _tombstone_gc_state_with_commitlog_check_disabled;
int64_t _output_repaired_at = 0;
private:
// Keeps track of monitors for input sstable.
// If _update_backlog_tracker is set to true, monitors are responsible for adjusting backlog as compaction progresses.
@@ -554,7 +627,7 @@ private:
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
}
protected:
compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
: _cdata(init_compaction_data(cdata, descriptor))
, _table_s(table_s)
, _sstable_creator(std::move(descriptor.creator))
@@ -607,6 +680,7 @@ protected:
}
void finish_new_sstable(compaction_writer* writer) {
writer->writer.set_repaired_at(_output_repaired_at);
writer->writer.consume_end_of_stream();
writer->sst->open_data().get();
_end_size += writer->sst->bytes_on_disk();
@@ -762,14 +836,14 @@ private:
return dht::to_partition_range(*r);
};
return make_flat_multi_range_reader(_schema, _permit, std::move(source),
return make_multi_range_reader(_schema, _permit, std::move(source),
std::move(owned_range_generator),
_schema->full_slice(),
tracing::trace_state_ptr());
}
virtual sstables::sstable_set make_sstable_set_for_input() const {
return _table_s.get_compaction_strategy().make_sstable_set(_schema);
return _table_s.get_compaction_strategy().make_sstable_set(_table_s);
}
const tombstone_gc_state& get_tombstone_gc_state() const {
@@ -783,12 +857,19 @@ private:
double sum_of_estimated_droppable_tombstone_ratio = 0;
_input_sstable_generations.reserve(_sstables.size());
_input_sstables_basic_info.reserve(_sstables.size());
int64_t repaired_at = 0;
std::vector<int64_t> repaired_at_for_compacted_sstables;
for (auto& sst : _sstables) {
co_await coroutine::maybe_yield();
auto& sst_stats = sst->get_stats_metadata();
repaired_at_for_compacted_sstables.push_back(sst_stats.repaired_at);
repaired_at = std::max(sst_stats.repaired_at, repaired_at);
timestamp_tracker.update(sst_stats.min_timestamp);
timestamp_tracker.update(sst_stats.max_timestamp);
_input_sstables_basic_info.emplace_back(sst->generation(), sst->get_origin(), sst->bytes_on_disk());
// Compacted sstable keeps track of its ancestors.
_input_sstable_generations.push_back(sst->generation());
_start_size += sst->bytes_on_disk();
@@ -816,7 +897,11 @@ private:
_rp = std::max(_rp, sst_stats.position);
}
}
log_info("{} [{}]", report_start_desc(), fmt::join(_sstables | std::views::transform([] (auto sst) { return to_string(sst, true); }), ","));
log_debug("{} [{}]", report_start_desc(), fmt::join(_sstables | std::views::transform([] (auto sst) { return to_string(sst, true); }), ","));
if (repaired_at) {
_output_repaired_at = repaired_at;
}
log_debug("repaired_at_vec={} output_repaired_at={}", repaired_at_for_compacted_sstables, _output_repaired_at);
if (ssts->size() < _sstables.size()) {
log_debug("{} out of {} input sstables are fully expired sstables that will not be actually compacted",
_sstables.size() - ssts->size(), _sstables.size());
@@ -842,7 +927,8 @@ private:
});
});
const auto& gc_state = get_tombstone_gc_state();
return consumer(make_compacting_reader(setup_sstable_reader(), compaction_time, max_purgeable_func(), gc_state));
return consumer(make_compacting_reader(setup_sstable_reader(), compaction_time, max_purgeable_func(), gc_state,
streamed_mutation::forwarding::no, &_tombstone_purge_stats));
}
future<> consume() {
@@ -859,22 +945,24 @@ private:
auto close_reader = deferred_close(reader);
if (enable_garbage_collected_sstable_writer()) {
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, compacted_fragments_writer>;
using compact_mutations = compact_for_compaction<compacted_fragments_writer, compacted_fragments_writer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
get_tombstone_gc_state(),
get_compacted_fragments_writer(),
get_gc_compacted_fragments_writer());
get_gc_compacted_fragments_writer(),
&_tombstone_purge_stats);
reader.consume_in_thread(std::move(cfc));
return;
}
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, noop_compacted_fragments_consumer>;
using compact_mutations = compact_for_compaction<compacted_fragments_writer, noop_compacted_fragments_consumer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
get_tombstone_gc_state(),
get_compacted_fragments_writer(),
noop_compacted_fragments_consumer());
noop_compacted_fragments_consumer(),
&_tombstone_purge_stats);
reader.consume_in_thread(std::move(cfc));
});
});
@@ -897,7 +985,7 @@ private:
// if the derived compaction wants to opt in for this behavior, in addition
// to overriding `make_interposer_consumer()`, it would have to override
// `use_interposer_consumer()` so it returns true.
virtual reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) {
virtual mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) {
return _table_s.get_compaction_strategy().make_interposer_consumer(_ms_metadata, std::move(end_consumer));
}
@@ -907,13 +995,19 @@ private:
protected:
virtual compaction_result finish(std::chrono::time_point<db_clock> started_at, std::chrono::time_point<db_clock> ended_at) {
compaction_result ret {
.shard_id = this_shard_id(),
.type = _type,
.sstables_in = std::move(_input_sstables_basic_info),
.sstables_out = extract_basic_info_from_sstables(_all_new_sstables),
.new_sstables = std::move(_all_new_sstables),
.stats {
.started_at = started_at,
.ended_at = ended_at,
.start_size = _start_size,
.end_size = _end_size,
.bloom_filter_checks = _bloom_filter_checks,
.reader_statistics = std::move(_reader_statistics),
.tombstone_purge_stats = std::move(_tombstone_purge_stats),
},
};
@@ -928,7 +1022,7 @@ protected:
// - add support to merge summary (message: Partition merge counts were {%s}.).
// - there is no easy way, currently, to know the exact number of total partitions.
// By the time being, using estimated key count.
log_info("{} {} sstables to [{}]. {} to {} (~{}% of original) in {}ms = {}. ~{} total partitions merged to {}.",
log_debug("{} {} sstables to [{}]. {} to {} (~{}% of original) in {}ms = {}. ~{} total partitions merged to {}.",
report_finish_desc(), _input_sstable_generations.size(),
fmt::join(ret.new_sstables | std::views::transform([] (auto sst) { return to_string(sst, false); }), ","),
utils::pretty_printed_data_size(_start_size), utils::pretty_printed_data_size(_end_size), int(ratio * 100),
@@ -1154,7 +1248,7 @@ void compacted_fragments_writer::consume_end_of_stream() {
class regular_compaction : public compaction {
seastar::semaphore _replacer_lock = {1};
public:
regular_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker = use_backlog_tracker::yes)
regular_compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker = use_backlog_tracker::yes)
: compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker)
{
}
@@ -1296,12 +1390,12 @@ private:
return bool(_replacer);
}
public:
reshape_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
reshape_compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
: regular_compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker::no) {
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
return sstables::make_partitioned_sstable_set(_schema, _table_s.token_range());
}
// Unconditionally enable incremental compaction if the strategy specifies a max output size, e.g. LCS.
@@ -1364,7 +1458,7 @@ public:
class cleanup_compaction final : public regular_compaction {
public:
cleanup_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
cleanup_compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
: regular_compaction(table_s, std::move(descriptor), cdata, progress_monitor)
{
}
@@ -1381,14 +1475,14 @@ public:
class split_compaction final : public regular_compaction {
compaction_type_options::split _options;
public:
split_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_type_options::split options,
split_compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_type_options::split options,
compaction_progress_monitor& progress_monitor)
: regular_compaction(table_s, std::move(descriptor), cdata, progress_monitor)
, _options(std::move(options))
{
}
reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {
mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {
return [this, end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {
return mutation_writer::segregate_by_token_group(std::move(reader),
_options.classifier,
@@ -1640,7 +1734,7 @@ private:
uint64_t _validation_errors = 0;
public:
scrub_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_type_options::scrub options, compaction_progress_monitor& progress_monitor)
scrub_compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_type_options::scrub options, compaction_progress_monitor& progress_monitor)
: regular_compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker::no)
, _options(options)
, _scrub_start_description(fmt::format("Scrubbing in {} mode", _options.operation_mode))
@@ -1682,7 +1776,7 @@ public:
}
}
reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {
mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {
if (!use_interposer_consumer()) {
return end_consumer;
}
@@ -1737,7 +1831,7 @@ private:
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimation_per_shard[s].estimated_partitions, _schema));
}
public:
resharding_compaction(table_state& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
resharding_compaction(compaction_group_view& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor)
: compaction(table_s, std::move(descriptor), cdata, progress_monitor, use_backlog_tracker::no)
, _estimation_per_shard(smp::count)
, _run_identifiers(smp::count)
@@ -1778,7 +1872,7 @@ public:
}
reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {
mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {
return [end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {
return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));
};
@@ -1852,9 +1946,9 @@ compaction_type compaction_type_options::type() const {
return index_to_type[_options.index()];
}
static std::unique_ptr<compaction> make_compaction(table_state& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor) {
static std::unique_ptr<compaction> make_compaction(compaction_group_view& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor) {
struct {
table_state& table_s;
compaction_group_view& table_s;
sstables::compaction_descriptor&& descriptor;
compaction_data& cdata;
compaction_progress_monitor& progress_monitor;
@@ -1885,7 +1979,7 @@ static std::unique_ptr<compaction> make_compaction(table_state& table_s, sstable
return descriptor.options.visit(visitor_factory);
}
static future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, read_monitor_generator& monitor_generator) {
static future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, read_monitor_generator& monitor_generator) {
auto schema = table_s.schema();
auto permit = table_s.make_compaction_reader_permit();
@@ -1923,7 +2017,7 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti
};
}
future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, compaction_progress_monitor& progress_monitor) {
future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, compaction_progress_monitor& progress_monitor) {
progress_monitor.set_generator(std::make_unique<compaction_read_monitor_generator>(table_s, use_backlog_tracker::no));
auto d = defer([&] { progress_monitor.reset_generator(); });
auto res = co_await scrub_sstables_validate_mode(descriptor, cdata, table_s, *progress_monitor._generator);
@@ -1931,7 +2025,7 @@ future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_desc
}
future<compaction_result>
compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, compaction_progress_monitor& progress_monitor) {
compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, compaction_progress_monitor& progress_monitor) {
if (descriptor.sstables.empty()) {
return make_exception_future<compaction_result>(std::runtime_error(format("Called {} compaction with empty set on behalf of {}.{}",
compaction_name(descriptor.options.type()), table_s.schema()->ks_name(), table_s.schema()->cf_name())));
@@ -1945,7 +2039,7 @@ compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cd
}
std::unordered_set<sstables::shared_sstable>
get_fully_expired_sstables(const table_state& table_s, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point compaction_time) {
get_fully_expired_sstables(const compaction_group_view& table_s, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point compaction_time) {
clogger.debug("Checking droppable sstables in {}.{}", table_s.schema()->ks_name(), table_s.schema()->cf_name());
if (compacting.empty()) {
@@ -1953,6 +2047,8 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
}
std::unordered_set<sstables::shared_sstable> candidates;
// Note: This contains both repaired and unrepaired sstables which means
// compaction consults both repaired and unrepaired sstables for tombstone gc.
auto uncompacting_sstables = get_uncompacting_sstables(table_s, compacting);
// Get list of uncompacting sstables that overlap the ones being compacted.
std::vector<sstables::shared_sstable> overlapping = leveled_manifest::overlapping(*table_s.schema(), compacting, uncompacting_sstables);

View File

@@ -11,11 +11,14 @@
#include "readers/combined_reader_stats.hh"
#include "sstables/shared_sstable.hh"
#include "sstables/generation_type.hh"
#include "compaction/compaction_descriptor.hh"
#include "mutation/mutation_tombstone_stats.hh"
#include "gc_clock.hh"
#include "utils/UUID.hh"
#include "table_state.hh"
#include "compaction_group_view.hh"
#include <seastar/core/abort_source.hh>
#include "sstables/basic_info.hh"
using namespace compaction;
@@ -72,6 +75,7 @@ struct compaction_data {
};
struct compaction_stats {
std::chrono::time_point<db_clock> started_at;
std::chrono::time_point<db_clock> ended_at;
uint64_t start_size = 0;
uint64_t end_size = 0;
@@ -79,13 +83,16 @@ struct compaction_stats {
// Bloom filter checks during max purgeable calculation
uint64_t bloom_filter_checks = 0;
combined_reader_statistics reader_statistics;
tombstone_purge_stats tombstone_purge_stats;
compaction_stats& operator+=(const compaction_stats& r) {
started_at = std::max(started_at, r.started_at);
ended_at = std::max(ended_at, r.ended_at);
start_size += r.start_size;
end_size += r.end_size;
validation_errors += r.validation_errors;
bloom_filter_checks += r.bloom_filter_checks;
tombstone_purge_stats += r.tombstone_purge_stats;
return *this;
}
friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {
@@ -96,6 +103,10 @@ struct compaction_stats {
};
struct compaction_result {
shard_id shard_id;
compaction_type type;
std::vector<sstables::basic_info> sstables_in;
std::vector<sstables::basic_info> sstables_out;
std::vector<sstables::shared_sstable> new_sstables;
compaction_stats stats;
};
@@ -112,7 +123,7 @@ public:
uint64_t get_progress() const;
friend class compaction;
friend future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor, compaction_data&, table_state&, compaction_progress_monitor&);
friend future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor, compaction_data&, compaction_group_view&, compaction_progress_monitor&);
};
// Compact a list of N sstables into M sstables.
@@ -120,7 +131,7 @@ public:
//
// compaction_descriptor is responsible for specifying the type of compaction, and influencing
// compaction behavior through its available member fields.
future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, compaction_progress_monitor& progress_monitor);
future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, compaction_progress_monitor& progress_monitor);
// Return list of expired sstables for column family cf.
// A sstable is fully expired *iff* its max_local_deletion_time precedes gc_before and its
@@ -128,7 +139,7 @@ future<compaction_result> compact_sstables(sstables::compaction_descriptor descr
// In simpler words, a sstable is fully expired if all of its live cells with TTL is expired
// and possibly doesn't contain any tombstone that covers cells in other sstables.
std::unordered_set<sstables::shared_sstable>
get_fully_expired_sstables(const table_state& table_s, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point gc_before);
get_fully_expired_sstables(const compaction_group_view& table_s, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point gc_before);
// For tests, can drop after we virtualize sstables.
mutation_reader make_scrubbing_reader(mutation_reader rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors);

View File

@@ -15,7 +15,7 @@
namespace compaction {
class table_state;
class compaction_group_view;
class strategy_control;
struct compaction_state;

View File

@@ -22,7 +22,91 @@ using can_gc_fn = std::function<bool(tombstone, is_shadowable)>;
extern can_gc_fn always_gc;
extern can_gc_fn never_gc;
using max_purgeable_fn = std::function<api::timestamp_type(const dht::decorated_key&, is_shadowable)>;
// For the purposes of overlap with live data, a tombstone is purgeable if:
// tombstone.timestamp ∈ (-inf, max_purgeable._timestamp)
//
// The above overlap check can be omitted iff:
// tombstone.deletion_time ∈ (-inf, max_purgeable._expiry_threshold.value_or(gc_clock::time_point::min()))
//
// So in other words, a tombstone is purgeable iff:
// tombstone.deletion_time < max_purgeable._expiry_threshold.value_or(gc_clock::time_point::min()) || tombstone.timestamp < max_purgeable._timestamp
//
// See can_purge() for more details.
class max_purgeable {
public:
enum class timestamp_source {
none,
memtable_possibly_shadowing_data,
other_sstables_possibly_shadowing_data
};
using expiry_threshold_opt = std::optional<gc_clock::time_point>;
private:
api::timestamp_type _timestamp { api::missing_timestamp };
expiry_threshold_opt _expiry_threshold;
timestamp_source _source { timestamp_source::none };
public:
max_purgeable() = default;
explicit max_purgeable(api::timestamp_type timestamp, timestamp_source source = timestamp_source::none)
: _timestamp(timestamp), _source(source)
{ }
explicit max_purgeable(api::timestamp_type timestamp, expiry_threshold_opt expiry_threshold, timestamp_source source = timestamp_source::none)
: _timestamp(timestamp), _expiry_threshold(expiry_threshold), _source(source)
{ }
operator bool() const { return _timestamp != api::missing_timestamp; }
bool operator==(const max_purgeable&) const = default;
bool operator!=(const max_purgeable&) const = default;
api::timestamp_type timestamp() const noexcept { return _timestamp; }
expiry_threshold_opt expiry_threshold() const noexcept { return _expiry_threshold; }
timestamp_source source() const noexcept { return _source; }
max_purgeable& combine(max_purgeable other);
struct can_purge_result {
bool can_purge { true };
timestamp_source timestamp_source { timestamp_source::none };
// can purge?
operator bool() const noexcept {
return can_purge;
}
bool operator!() const noexcept {
return !can_purge;
}
};
// Determines whether the tombstone can be purged.
//
// If available, the expiry threshold is used to maybe elide the overlap
// check against the min live timestamp. The overlap check elision is
// possible if the tombstone's deletion time is < than the expiry threshold
// or in other words: the tombstone was already expired when the data
// source(s) represented by this max_purgeable were created. Consequently,
// all writes in these data sources arrived *after* the tombstone was already
// expired and hence it is not relevant to these writes, even if they
// otherwise overlap with the tombstone's timestamp.
//
// The overlap check elision is an optimization, checking whether a tombstone
// can be purged by just looking at the timestamps is still correct (but
// stricter).
can_purge_result can_purge(tombstone) const;
};
template <>
struct fmt::formatter<max_purgeable::timestamp_source> : fmt::formatter<string_view> {
auto format(max_purgeable::timestamp_source, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <>
struct fmt::formatter<max_purgeable> : fmt::formatter<string_view> {
auto format(max_purgeable, fmt::format_context& ctx) const -> decltype(ctx.out());
};
using max_purgeable_fn = std::function<max_purgeable(const dht::decorated_key&, is_shadowable)>;
extern max_purgeable_fn can_always_purge;
extern max_purgeable_fn can_never_purge;

View File

@@ -30,15 +30,16 @@ class compaction_strategy_state;
namespace compaction {
class table_state {
class compaction_group_view {
public:
virtual ~table_state() {}
virtual ~compaction_group_view() {}
virtual dht::token_range token_range() const noexcept = 0;
virtual const schema_ptr& schema() const noexcept = 0;
// min threshold as defined by table.
virtual unsigned min_compaction_threshold() const noexcept = 0;
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
virtual const sstables::sstable_set& main_sstable_set() const = 0;
virtual const sstables::sstable_set& maintenance_sstable_set() const = 0;
virtual future<lw_shared_ptr<const sstables::sstable_set>> main_sstable_set() const = 0;
virtual future<lw_shared_ptr<const sstables::sstable_set>> maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
@@ -60,6 +61,7 @@ public:
virtual const std::string get_group_id() const noexcept = 0;
virtual seastar::condition_variable& get_staging_done_condition() noexcept = 0;
virtual dht::token_range get_token_range_after_split(const dht::token& t) const noexcept = 0;
virtual int64_t get_sstables_repaired_at() const noexcept = 0;
};
} // namespace compaction
@@ -67,9 +69,9 @@ public:
namespace fmt {
template <>
struct formatter<compaction::table_state> : formatter<string_view> {
struct formatter<compaction::compaction_group_view> : formatter<string_view> {
template <typename FormatContext>
auto format(const compaction::table_state& t, FormatContext& ctx) const {
auto format(const compaction::compaction_group_view& t, FormatContext& ctx) const {
auto s = t.schema();
return fmt::format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
}

File diff suppressed because it is too large Load Diff

View File

@@ -16,6 +16,7 @@
#include <seastar/core/metrics_registration.hh>
#include <seastar/core/abort_source.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/rwlock.hh>
#include "sstables/shared_sstable.hh"
#include "utils/exponential_backoff_retry.hh"
#include "utils/updateable_value.hh"
@@ -32,10 +33,12 @@
#include "seastarx.hh"
#include "sstables/exceptions.hh"
#include "tombstone_gc.hh"
#include "utils/pluggable.hh"
#include "compaction/compaction_reenabler.hh"
namespace db {
class system_keyspace;
class compaction_history_entry;
class system_keyspace;
}
namespace sstables { class test_env_compaction_manager; }
@@ -122,12 +125,12 @@ private:
future<> _waiting_reevalution = make_ready_future<>();
condition_variable _postponed_reevaluation;
// tables that wait for compaction but had its submission postponed due to ongoing compaction.
std::unordered_set<compaction::table_state*> _postponed;
std::unordered_set<compaction::compaction_group_view*> _postponed;
// tracks taken weights of ongoing compactions, only one compaction per weight is allowed.
// weight is value assigned to a compaction job that is log base N of total size of all input sstables.
std::unordered_set<int> _weight_tracker;
std::unordered_map<compaction::table_state*, compaction_state> _compaction_state;
std::unordered_map<compaction::compaction_group_view*, compaction_state> _compaction_state;
// Purpose is to serialize all maintenance (non regular) compaction activity to reduce aggressiveness and space requirement.
// If the operation must be serialized with regular, then the per-table write lock must be taken.
@@ -138,7 +141,7 @@ private:
// being picked more than once.
seastar::named_semaphore _off_strategy_sem = {1, named_semaphore_exception_factory{"off-strategy compaction"}};
seastar::shared_ptr<db::system_keyspace> _sys_ks;
utils::pluggable<db::system_keyspace> _sys_ks;
std::function<void()> compaction_submission_callback();
// all registered tables are reevaluated at a constant interval.
@@ -159,14 +162,17 @@ private:
class strategy_control;
std::unique_ptr<strategy_control> _strategy_control;
per_table_history_maps _reconcile_history_maps;
shared_tombstone_gc_state _shared_tombstone_gc_state;
// TODO: tombstone_gc_state should now have value semantics, but the code
// still uses it with reference semantics (inconsistently though).
// Drop this member, once the code is converted into using value semantics.
tombstone_gc_state _tombstone_gc_state;
private:
// Requires task->_compaction_state.gate to be held and task to be registered in _tasks.
future<compaction_stats_opt> perform_task(shared_ptr<compaction::compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
// Return nullopt if compaction cannot be started
std::optional<gate::holder> start_compaction(table_state& t);
std::optional<gate::holder> start_compaction(compaction_group_view& t);
template<typename TaskExecutor, typename... Args>
requires std::is_base_of_v<compaction_task_executor, TaskExecutor> &&
@@ -176,14 +182,15 @@ private:
}
future<compaction_manager::compaction_stats_opt> perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason) noexcept;
void stop_tasks(const std::vector<shared_ptr<compaction::compaction_task_executor>>& tasks, sstring reason) noexcept;
future<> await_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>>, bool task_stopped) const noexcept;
future<> update_throughput(uint32_t value_mbs);
// Return the largest fan-in of currently running compactions
unsigned current_compaction_fan_in_threshold() const;
// Return true if compaction can be initiated
bool can_register_compaction(compaction::table_state& t, int weight, unsigned fan_in) const;
bool can_register_compaction(compaction::compaction_group_view& t, int weight, unsigned fan_in) const;
// Register weight for a table. Do that only if can_register_weight()
// returned true.
void register_weight(int weight);
@@ -191,14 +198,14 @@ private:
void deregister_weight(int weight);
// Get candidates for compaction strategy, which are all sstables but the ones being compacted.
std::vector<sstables::shared_sstable> get_candidates(compaction::table_state& t) const;
future<std::vector<sstables::shared_sstable>> get_candidates(compaction::compaction_group_view& t) const;
bool eligible_for_compaction(const sstables::shared_sstable& sstable) const;
bool eligible_for_compaction(const sstables::frozen_sstable_run& sstable_run) const;
template <std::ranges::range Range>
requires std::convertible_to<std::ranges::range_value_t<Range>, sstables::shared_sstable> || std::convertible_to<std::ranges::range_value_t<Range>, sstables::frozen_sstable_run>
std::vector<std::ranges::range_value_t<Range>> get_candidates(table_state& t, const Range& sstables) const;
std::vector<std::ranges::range_value_t<Range>> get_candidates(compaction_group_view& t, const Range& sstables) const;
template <std::ranges::range Range>
requires std::same_as<std::ranges::range_value_t<Range>, sstables::shared_sstable>
@@ -210,23 +217,23 @@ private:
// gets the table's compaction state
// throws std::out_of_range exception if not found.
compaction_state& get_compaction_state(compaction::table_state* t);
const compaction_state& get_compaction_state(compaction::table_state* t) const {
compaction_state& get_compaction_state(compaction::compaction_group_view* t);
const compaction_state& get_compaction_state(compaction::compaction_group_view* t) const {
return const_cast<compaction_manager*>(this)->get_compaction_state(t);
}
// Return true if compaction manager is enabled and
// table still exists and compaction is not disabled for the table.
inline bool can_proceed(compaction::table_state* t) const;
inline bool can_proceed(compaction::compaction_group_view* t) const;
future<> postponed_compactions_reevaluation();
void reevaluate_postponed_compactions() noexcept;
// Postpone compaction for a table that couldn't be executed due to ongoing
// similar-sized compaction.
void postpone_compaction_for_table(compaction::table_state* t);
void postpone_compaction_for_table(compaction::compaction_group_view* t);
using quarantine_invalid_sstables = sstables::compaction_type_options::scrub::quarantine_invalid_sstables;
future<compaction_stats_opt> perform_sstable_scrub_validate_mode(compaction::table_state& t, tasks::task_info info, quarantine_invalid_sstables quarantine_sstables);
future<compaction_stats_opt> perform_sstable_scrub_validate_mode(compaction::compaction_group_view& t, tasks::task_info info, quarantine_invalid_sstables quarantine_sstables);
future<> update_static_shares(float shares);
using get_candidates_func = std::function<future<std::vector<sstables::shared_sstable>>()>;
@@ -236,9 +243,10 @@ private:
template<typename TaskType, typename... Args>
requires std::derived_from<TaskType, compaction_task_executor> &&
std::derived_from<TaskType, compaction_task_impl>
future<compaction_manager::compaction_stats_opt> perform_task_on_all_files(tasks::task_info info, table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args);
future<compaction_stats_opt> rewrite_sstables(compaction::table_state& t, sstables::compaction_type_options options, owned_ranges_ptr, get_candidates_func, tasks::task_info info,
future<compaction_manager::compaction_stats_opt> perform_task_on_all_files(sstring reason, tasks::task_info info, compaction_group_view& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args);
future<compaction_stats_opt> rewrite_sstables(compaction::compaction_group_view& t, sstables::compaction_type_options options, owned_ranges_ptr, get_candidates_func, tasks::task_info info,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstring options_desc = "");
// Stop all fibers, without waiting. Safe to be called multiple times.
@@ -246,7 +254,7 @@ private:
future<> really_do_stop() noexcept;
// Propagate replacement of sstables to all ongoing compaction of a given table
void propagate_replacement(compaction::table_state& t, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added);
void propagate_replacement(compaction::compaction_group_view& t, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added);
// This constructor is supposed to only be used for testing so lets be more explicit
// about invoking it. Ref #10146
@@ -304,18 +312,18 @@ public:
future<> get_compaction_history(compaction_history_consumer&& f);
// Submit a table to be compacted.
void submit(compaction::table_state& t);
void submit(compaction::compaction_group_view& t);
// Can regular compaction be performed in the given table
bool can_perform_regular_compaction(compaction::table_state& t);
bool can_perform_regular_compaction(compaction::compaction_group_view& t);
// Maybe wait before adding more sstables
// if there are too many sstables.
future<> maybe_wait_for_sstable_count_reduction(compaction::table_state& t);
future<> maybe_wait_for_sstable_count_reduction(compaction::compaction_group_view& t);
// Submit a table to be off-strategy compacted.
// Returns true iff off-strategy compaction was required and performed.
future<bool> perform_offstrategy(compaction::table_state& t, tasks::task_info info);
future<bool> perform_offstrategy(compaction::compaction_group_view& t, tasks::task_info info);
// Submit a table to be cleaned up and wait for its termination.
//
@@ -324,34 +332,34 @@ public:
// Cleanup is about discarding keys that are no longer relevant for a
// given sstable, e.g. after node loses part of its token range because
// of a newly added node.
future<> perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, tasks::task_info info);
future<> perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::compaction_group_view& t, tasks::task_info info);
private:
future<> try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, tasks::task_info info);
future<> try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::compaction_group_view& t, tasks::task_info info);
// Add sst to or remove it from the respective compaction_state.sstables_requiring_cleanup set.
bool update_sstable_cleanup_state(table_state& t, const sstables::shared_sstable& sst, const dht::token_range_vector& sorted_owned_ranges);
bool update_sstable_cleanup_state(compaction_group_view& t, const sstables::shared_sstable& sst, const dht::token_range_vector& sorted_owned_ranges);
future<> on_compaction_completion(table_state& t, sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy);
future<> on_compaction_completion(compaction_group_view& t, sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy);
public:
// Submit a table to be upgraded and wait for its termination.
future<> perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, bool exclude_current_version, tasks::task_info info);
future<> perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, compaction::compaction_group_view& t, bool exclude_current_version, tasks::task_info info);
// Submit a table to be scrubbed and wait for its termination.
future<compaction_stats_opt> perform_sstable_scrub(compaction::table_state& t, sstables::compaction_type_options::scrub opts, tasks::task_info info);
future<compaction_stats_opt> perform_sstable_scrub(compaction::compaction_group_view& t, sstables::compaction_type_options::scrub opts, tasks::task_info info);
// Submit a table for major compaction.
future<> perform_major_compaction(compaction::table_state& t, tasks::task_info info, bool consider_only_existing_data = false);
future<> perform_major_compaction(compaction::compaction_group_view& t, tasks::task_info info, bool consider_only_existing_data = false);
// Splits a compaction group by segregating all its sstable according to the classifier[1].
// [1]: See sstables::compaction_type_options::splitting::classifier.
// Returns when all sstables in the main sstable set are split. The only exception is shutdown
// or user aborted splitting using stop API.
future<compaction_stats_opt> perform_split_compaction(compaction::table_state& t, sstables::compaction_type_options::split opt, tasks::task_info info);
future<compaction_stats_opt> perform_split_compaction(compaction::compaction_group_view& t, sstables::compaction_type_options::split opt, tasks::task_info info);
// Splits a single SSTable by segregating all its data according to the classifier.
// If SSTable doesn't need split, the same input SSTable is returned as output.
// If SSTable needs split, then output SSTables are returned and the input SSTable is deleted.
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt);
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, compaction_group_view& t, sstables::compaction_type_options::split opt);
// Run a custom job for a given table, defined by a function
// it completes when future returned by job is ready or returns immediately
@@ -360,64 +368,60 @@ public:
// parameter type is the compaction type the operation can most closely be
// associated with, use compaction_type::Compaction, if none apply.
// parameter job is a function that will carry the operation
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, tasks::task_info info, throw_if_stopping do_throw_if_stopping);
class compaction_reenabler {
compaction_manager& _cm;
compaction::table_state* _table;
compaction::compaction_state& _compaction_state;
gate::holder _holder;
public:
compaction_reenabler(compaction_manager&, compaction::table_state&);
compaction_reenabler(compaction_reenabler&&) noexcept;
~compaction_reenabler();
compaction::table_state* compacting_table() const noexcept {
return _table;
}
const compaction::compaction_state& compaction_state() const noexcept {
return _compaction_state;
}
};
future<> run_custom_job(compaction::compaction_group_view& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, tasks::task_info info, throw_if_stopping do_throw_if_stopping);
// Disable compaction temporarily for a table t.
// Caller should call the compaction_reenabler::reenable
future<compaction_reenabler> stop_and_disable_compaction(compaction::table_state& t);
future<compaction_reenabler> stop_and_disable_compaction(sstring reason, compaction::compaction_group_view& t);
future<compaction_reenabler> await_and_disable_compaction(compaction::compaction_group_view& t);
future<seastar::rwlock::holder> get_incremental_repair_read_lock(compaction::compaction_group_view& t, const sstring& reason);
future<seastar::rwlock::holder> get_incremental_repair_write_lock(compaction::compaction_group_view& t, const sstring& reason);
// Run a function with compaction temporarily disabled for a table T.
future<> run_with_compaction_disabled(compaction::table_state& t, std::function<future<> ()> func);
future<> run_with_compaction_disabled(compaction::compaction_group_view& t, std::function<future<> ()> func, sstring reason = "custom operation");
void plug_system_keyspace(db::system_keyspace& sys_ks) noexcept;
void unplug_system_keyspace() noexcept;
future<> unplug_system_keyspace() noexcept;
// Adds a table to the compaction manager.
// Creates a compaction_state structure that can be used for submitting
// compaction jobs of all types.
void add(compaction::table_state& t);
void add(compaction::compaction_group_view& t);
// Adds a group with compaction temporarily disabled. Compaction is only enabled back
// when the compaction_reenabler returned is destroyed.
compaction_reenabler add_with_compaction_disabled(compaction::compaction_group_view& view);
// Remove a table from the compaction manager.
// Cancel requests on table and wait for possible ongoing compactions.
future<> remove(compaction::table_state& t, sstring reason = "table removal") noexcept;
future<> remove(compaction::compaction_group_view& t, sstring reason = "table removal") noexcept;
const stats& get_stats() const {
return _stats;
}
const std::vector<sstables::compaction_info> get_compactions(compaction::table_state* t = nullptr) const;
const std::vector<sstables::compaction_info> get_compactions(compaction::compaction_group_view* t = nullptr) const;
// Returns true if table has an ongoing compaction, running on its behalf
bool has_table_ongoing_compaction(const compaction::table_state& t) const;
bool has_table_ongoing_compaction(const compaction::compaction_group_view& t) const;
bool compaction_disabled(compaction::table_state& t) const;
bool compaction_disabled(compaction::compaction_group_view& t) const;
// Stops ongoing compaction of a given type.
future<> stop_compaction(sstring type, compaction::table_state* table = nullptr);
future<> stop_compaction(sstring type, compaction::compaction_group_view* table = nullptr);
private:
std::vector<shared_ptr<compaction_task_executor>>
do_stop_ongoing_compactions(sstring reason, compaction_group_view* t, std::optional<sstables::compaction_type> type_opt) noexcept;
public:
// Stops ongoing compaction of a given table and/or compaction_type.
future<> stop_ongoing_compactions(sstring reason, compaction::table_state* t = nullptr, std::optional<sstables::compaction_type> type_opt = {}) noexcept;
future<> stop_ongoing_compactions(sstring reason, compaction::compaction_group_view* t = nullptr, std::optional<sstables::compaction_type> type_opt = {}) noexcept;
future<> await_ongoing_compactions(compaction_group_view* t);
compaction_reenabler stop_and_disable_compaction_no_wait(compaction_group_view& t, sstring reason);
double backlog() {
return _backlog_manager.backlog();
@@ -426,29 +430,32 @@ public:
void register_backlog_tracker(compaction_backlog_tracker& backlog_tracker) {
_backlog_manager.register_backlog_tracker(backlog_tracker);
}
void register_backlog_tracker(compaction::table_state& t, compaction_backlog_tracker new_backlog_tracker);
compaction_backlog_tracker& get_backlog_tracker(compaction::table_state& t);
compaction_backlog_tracker& get_backlog_tracker(compaction::compaction_group_view& t);
static sstables::compaction_data create_compaction_data();
compaction::strategy_control& get_strategy_control() const noexcept;
tombstone_gc_state& get_tombstone_gc_state() noexcept {
return _tombstone_gc_state;
};
const tombstone_gc_state& get_tombstone_gc_state() const noexcept {
return _tombstone_gc_state;
};
shared_tombstone_gc_state& get_shared_tombstone_gc_state() noexcept {
return _shared_tombstone_gc_state;
};
const shared_tombstone_gc_state& get_shared_tombstone_gc_state() const noexcept {
return _shared_tombstone_gc_state;
};
// Uncoditionally erase sst from `sstables_requiring_cleanup`
// Returns true iff sst was found and erased.
bool erase_sstable_cleanup_state(table_state& t, const sstables::shared_sstable& sst);
bool erase_sstable_cleanup_state(compaction_group_view& t, const sstables::shared_sstable& sst);
// checks if the sstable is in the respective compaction_state.sstables_requiring_cleanup set.
bool requires_cleanup(table_state& t, const sstables::shared_sstable& sst) const;
const std::unordered_set<sstables::shared_sstable>& sstables_requiring_cleanup(table_state& t) const;
bool requires_cleanup(compaction_group_view& t, const sstables::shared_sstable& sst) const;
const std::unordered_set<sstables::shared_sstable>& sstables_requiring_cleanup(compaction_group_view& t) const;
friend class compacting_sstable_registration;
friend class compaction_weight_registration;
@@ -464,6 +471,7 @@ public:
friend class compaction::rewrite_sstables_compaction_task_executor;
friend class compaction::cleanup_sstables_compaction_task_executor;
friend class compaction::validate_sstables_compaction_task_executor;
friend compaction_reenabler;
};
namespace compaction {
@@ -488,7 +496,7 @@ public:
};
protected:
compaction_manager& _cm;
::compaction::table_state* _compacting_table = nullptr;
::compaction::compaction_group_view* _compacting_table = nullptr;
compaction::compaction_state& _compaction_state;
sstables::compaction_data _compaction_data;
state _state = state::none;
@@ -504,7 +512,7 @@ private:
compaction_manager::compaction_stats_opt _stats = std::nullopt;
public:
explicit compaction_task_executor(compaction_manager& mgr, throw_if_stopping do_throw_if_stopping, ::compaction::table_state* t, sstables::compaction_type type, sstring desc);
explicit compaction_task_executor(compaction_manager& mgr, throw_if_stopping do_throw_if_stopping, ::compaction::compaction_group_view* t, sstables::compaction_type type, sstring desc);
compaction_task_executor(compaction_task_executor&&) = delete;
compaction_task_executor(const compaction_task_executor&) = delete;
@@ -548,7 +556,7 @@ protected:
future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,
compaction_manager::can_purge_tombstones can_purge = compaction_manager::can_purge_tombstones::yes,
sstables::offstrategy offstrategy = sstables::offstrategy::no);
future<> update_history(::compaction::table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata);
future<> update_history(::compaction::compaction_group_view& t, sstables::compaction_result&& res, const sstables::compaction_data& cdata);
bool should_update_history(sstables::compaction_type ct) {
return ct == sstables::compaction_type::Compaction;
}
@@ -559,7 +567,7 @@ public:
future<compaction_manager::compaction_stats_opt> run_compaction() noexcept;
const ::compaction::table_state* compacting_table() const noexcept {
const ::compaction::compaction_group_view* compacting_table() const noexcept {
return _compacting_table;
}
@@ -595,7 +603,7 @@ private:
return _compaction_done.get_future();
}
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::table_state& t);
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::compaction_group_view& t);
public:
bool stopping() const noexcept {
return _compaction_data.abort.abort_requested();
@@ -616,7 +624,8 @@ public:
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
friend fmt::formatter<compaction_task_executor>;
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept;
friend void compaction_manager::stop_tasks(const std::vector<shared_ptr<compaction_task_executor>>& tasks, sstring reason) noexcept;
friend future<> compaction_manager::await_tasks(std::vector<shared_ptr<compaction_task_executor>>, bool task_stopped) const noexcept;
friend sstables::test_env_compaction_manager;
};
@@ -637,4 +646,4 @@ struct fmt::formatter<compaction::compaction_task_executor> {
bool needs_cleanup(const sstables::shared_sstable& sst, const dht::token_range_vector& owned_ranges);
// Return all sstables but those that are off-strategy like the ones in maintenance set and staging dir.
std::vector<sstables::shared_sstable> in_strategy_sstables(compaction::table_state& table_s);
future<std::vector<sstables::shared_sstable>> in_strategy_sstables(compaction::compaction_group_view& table_s);

View File

@@ -0,0 +1,40 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/gate.hh>
class compaction_manager;
namespace compaction {
class compaction_group_view;
class compaction_state;
}
class compaction_reenabler {
compaction_manager& _cm;
compaction::compaction_group_view* _table;
compaction::compaction_state& _compaction_state;
seastar::gate::holder _holder;
public:
compaction_reenabler(compaction_manager&, compaction::compaction_group_view&);
compaction_reenabler(compaction_reenabler&&) noexcept;
~compaction_reenabler();
compaction::compaction_group_view* compacting_table() const noexcept {
return _table;
}
const compaction::compaction_state& compaction_state() const noexcept {
return _compaction_state;
}
};

View File

@@ -19,28 +19,41 @@
namespace compaction {
// There's 1:1 relationship between compaction_grop_view and compaction_state.
// Two or more compaction_group_view can be served by the same instance of sstable::sstable_set,
// so it's not safe to track any sstable state here.
struct compaction_state {
// Used both by compaction tasks that refer to the compaction_state
// and by any function running under run_with_compaction_disabled().
seastar::gate gate;
seastar::named_gate gate;
// Prevents table from running major and minor compaction at the same time.
// Used for synchronizing selection of sstable for compaction.
// Write lock is held when getting sstable list, feeding them into strategy, and registering compacting sstables.
// The lock prevents two concurrent compaction tasks from picking the same sstables. And it also helps major
// to synchronize with minor, such that major doesn't miss any sstable.
seastar::rwlock lock;
// Compations like major need to work on all sstables in the unrepaired
// set, no matter if the sstable is being repaired or not. The
// incremental_repair_lock lock is introduced to serialize repair and such
// compactions. This lock guarantees that no sstables are being repaired.
// Note that the minor compactions do not need to take this lock because
// they ignore sstables that are being repaired.
seastar::rwlock incremental_repair_lock;
// Raised by any function running under run_with_compaction_disabled();
long compaction_disabled_counter = 0;
// Signaled whenever a compaction task completes.
condition_variable compaction_done;
std::optional<compaction_backlog_tracker> backlog_tracker;
// Used only with vnodes, will not work with tablets. Can be removed once vnodes are gone.
std::unordered_set<sstables::shared_sstable> sstables_requiring_cleanup;
compaction::owned_ranges_ptr owned_ranges_ptr;
gc_clock::time_point last_regular_compaction;
explicit compaction_state(table_state& t);
explicit compaction_state(compaction_group_view& t);
compaction_state(compaction_state&&) = delete;
~compaction_state();

View File

@@ -46,7 +46,7 @@ compaction_descriptor compaction_strategy_impl::make_major_compaction_job(std::v
return compaction_descriptor(std::move(candidates), level, max_sstable_bytes);
}
std::vector<compaction_descriptor> compaction_strategy_impl::get_cleanup_compaction_jobs(table_state& table_s, std::vector<shared_sstable> candidates) const {
std::vector<compaction_descriptor> compaction_strategy_impl::get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<shared_sstable> candidates) const {
// The default implementation is suboptimal and causes the writeamp problem described issue in #10097.
// The compaction strategy relying on it should strive to implement its own method, to make cleanup bucket aware.
return candidates | std::views::transform([] (const shared_sstable& sst) {
@@ -55,7 +55,7 @@ std::vector<compaction_descriptor> compaction_strategy_impl::get_cleanup_compact
}) | std::ranges::to<std::vector>();
}
bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const table_state& t) {
bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const compaction_group_view& t) {
if (_disable_tombstone_compaction) {
return false;
}
@@ -77,7 +77,7 @@ uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_sour
return partition_estimate;
}
reader_consumer_v2 compaction_strategy_impl::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const {
mutation_reader_consumer compaction_strategy_impl::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
return end_consumer;
}
@@ -581,12 +581,12 @@ struct null_backlog_tracker final : public compaction_backlog_tracker::impl {
//
class null_compaction_strategy : public compaction_strategy_impl {
public:
virtual compaction_descriptor get_sstables_for_compaction(table_state& table_s, strategy_control& control) override {
return sstables::compaction_descriptor();
virtual future<compaction_descriptor> get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) override {
return make_ready_future<sstables::compaction_descriptor>();
}
virtual int64_t estimated_pending_compactions(table_state& table_s) const override {
return 0;
virtual future<int64_t> estimated_pending_compactions(compaction_group_view& table_s) const override {
return make_ready_future<int64_t>(0);
}
virtual compaction_strategy_type type() const override {
@@ -700,19 +700,19 @@ compaction_strategy_type compaction_strategy::type() const {
return _compaction_strategy_impl->type();
}
compaction_descriptor compaction_strategy::get_sstables_for_compaction(table_state& table_s, strategy_control& control) {
future<compaction_descriptor> compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
return _compaction_strategy_impl->get_sstables_for_compaction(table_s, control);
}
compaction_descriptor compaction_strategy::get_major_compaction_job(table_state& table_s, std::vector<sstables::shared_sstable> candidates) {
compaction_descriptor compaction_strategy::get_major_compaction_job(compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) {
return _compaction_strategy_impl->get_major_compaction_job(table_s, std::move(candidates));
}
std::vector<compaction_descriptor> compaction_strategy::get_cleanup_compaction_jobs(table_state& table_s, std::vector<shared_sstable> candidates) const {
std::vector<compaction_descriptor> compaction_strategy::get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<shared_sstable> candidates) const {
return _compaction_strategy_impl->get_cleanup_compaction_jobs(table_s, std::move(candidates));
}
void compaction_strategy::notify_completion(table_state& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) {
void compaction_strategy::notify_completion(compaction_group_view& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) {
_compaction_strategy_impl->notify_completion(table_s, removed, added);
}
@@ -720,7 +720,7 @@ bool compaction_strategy::parallel_compaction() const {
return _compaction_strategy_impl->parallel_compaction();
}
int64_t compaction_strategy::estimated_pending_compactions(table_state& table_s) const {
future<int64_t> compaction_strategy::estimated_pending_compactions(compaction_group_view& table_s) const {
return _compaction_strategy_impl->estimated_pending_compactions(table_s);
}
@@ -741,7 +741,7 @@ uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_me
return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate, std::move(schema));
}
reader_consumer_v2 compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const {
mutation_reader_consumer compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
return _compaction_strategy_impl->make_interposer_consumer(ms_meta, std::move(end_consumer));
}
@@ -789,8 +789,8 @@ future<reshape_config> make_reshape_config(const sstables::storage& storage, res
};
}
std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(schema_ptr schema) const {
return std::make_unique<partitioned_sstable_set>(std::move(schema), false);
std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(const compaction_group_view& ts) const {
return std::make_unique<partitioned_sstable_set>(ts.schema(), ts.token_range());
}
}

View File

@@ -12,7 +12,7 @@
#include "sstables/shared_sstable.hh"
#include "exceptions/exceptions.hh"
#include "compaction_strategy_type.hh"
#include "table_state.hh"
#include "compaction_group_view.hh"
#include "strategy_control.hh"
struct mutation_source_metadata;
@@ -41,15 +41,15 @@ public:
compaction_strategy& operator=(compaction_strategy&&);
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(table_state& table_s, strategy_control& control);
future<compaction_descriptor> get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control);
compaction_descriptor get_major_compaction_job(table_state& table_s, std::vector<shared_sstable> candidates);
compaction_descriptor get_major_compaction_job(compaction_group_view& table_s, std::vector<shared_sstable> candidates);
std::vector<compaction_descriptor> get_cleanup_compaction_jobs(table_state& table_s, std::vector<shared_sstable> candidates) const;
std::vector<compaction_descriptor> get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<shared_sstable> candidates) const;
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(table_state& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);
void notify_completion(compaction_group_view& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);
// Return if parallel compaction is allowed by strategy.
bool parallel_compaction() const;
@@ -58,7 +58,7 @@ public:
bool use_clustering_key_filter() const;
// An estimation of number of compaction for strategy to be satisfied.
int64_t estimated_pending_compactions(table_state& table_s) const;
future<int64_t> estimated_pending_compactions(compaction_group_view& table_s) const;
static sstring name(compaction_strategy_type type) {
switch (type) {
@@ -105,13 +105,13 @@ public:
return name(type());
}
sstable_set make_sstable_set(schema_ptr schema) const;
sstable_set make_sstable_set(const compaction_group_view& ts) const;
compaction_backlog_tracker make_backlog_tracker() const;
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr) const;
reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const;
mutation_reader_consumer make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const;
// Returns whether or not interposer consumer is used by a given strategy.
bool use_interposer_consumer() const;

View File

@@ -45,18 +45,18 @@ protected:
uint64_t max_sstable_bytes = compaction_descriptor::default_max_sstable_bytes);
public:
virtual ~compaction_strategy_impl() {}
virtual compaction_descriptor get_sstables_for_compaction(table_state& table_s, strategy_control& control) = 0;
virtual compaction_descriptor get_major_compaction_job(table_state& table_s, std::vector<sstables::shared_sstable> candidates) {
virtual future<compaction_descriptor> get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) = 0;
virtual compaction_descriptor get_major_compaction_job(compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) {
return make_major_compaction_job(std::move(candidates));
}
virtual std::vector<compaction_descriptor> get_cleanup_compaction_jobs(table_state& table_s, std::vector<shared_sstable> candidates) const;
virtual void notify_completion(table_state& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) { }
virtual std::vector<compaction_descriptor> get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<shared_sstable> candidates) const;
virtual void notify_completion(compaction_group_view& table_s, const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) { }
virtual compaction_strategy_type type() const = 0;
virtual bool parallel_compaction() const {
return true;
}
virtual int64_t estimated_pending_compactions(table_state& table_s) const = 0;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const;
virtual future<int64_t> estimated_pending_compactions(compaction_group_view& table_s) const = 0;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const compaction_group_view& ts) const;
bool use_clustering_key_filter() const {
return _use_clustering_key_filter;
@@ -64,7 +64,7 @@ public:
// Check if a given sstable is entitled for tombstone compaction based on its
// droppable tombstone histogram and gc_before.
bool worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const table_state& t);
bool worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const compaction_group_view& t);
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const = 0;
@@ -82,7 +82,7 @@ public:
/// @return A new functor that wraps the end consumer with additional processing capabilities
/// @note The returned functor preserves the original consumer's semantics while allowing
/// preprocessing of data
virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const;
virtual mutation_reader_consumer make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const;
virtual bool use_interposer_consumer() const {
return false;

Some files were not shown because too many files have changed in this diff Show More