This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.
* New feature, no backport required
Closesscylladb/scylladb#22090
* github.com:scylladb/scylladb:
tablets: resize_decision: get rid of initial_decision
tablet_allocator: consider tablet options for resize decision
tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
cql3/create_keyspace_statement: add deprecation warning for initial tablets
test: cqlpy: test_tablets: add tests for per-table tablet options
schema: add per-table tablet options
feature_service: add TABLET_OPTIONS cluster schema feature
It's possible to modify 'memtable_flush_period_in_ms' option only and as
single option, not with any other options together
Refs #20999Fixes#21223Closesscylladb/scylladb#22536
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.
TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed. In this case
the tables tablet counts should be scaled down proportionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than target_max_tablet_size. We need both the target
as well as max and min tablet sizes, so there is no sense
in keeping the max and deriving the target and the minimum
for the max value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the keyspace initial_tablets for min_tablet_count, if the latter
isn't set, then take the maximum of the option-based tablet counts:
- min_tablet_count
- and expected_data_size_in_gb / target_tablet_size
- min_per_shard_tablet_count (via
calculate_initial_tablets_from_topology)
If none of the hints produce a positive tablet_count,
fall back to calculate_initial_tablets_from_topology * initial_scale.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.
Fixes: #22620Closesscylladb/scylladb#22622
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session
This change moves the creation of the session ID for truncate from the request
creation to the request handling.
Fixes#22613Closesscylladb/scylladb#22615
task_stats contains short info about a task. To get a list of task_stats
in the module, one needs to request /task_manager/list_module_tasks/{module}.
To make identification and navigation between tasks easier, extend
task_stats to contain shard, start_time, and end_time.
Closesscylladb/scylladb#22351
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
No backport needed; new functionality
Closesscylladb/scylladb#22372
* github.com:scylladb/scylladb:
test: add test to check tablet repair child
service: add child for tablet repair virtual task
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.
To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.
Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.
Fixes: #21975.
Closesscylladb/scylladb#21995
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
Repair service is started after storage service, while storage service needs to reference repair one for its needs. Recently it was noticed, that this reverse order may cause troubles and was fixed with the help of an extra gate. That's not nice and makes the start-stop mess even worse. The correct fix is to fix the order both services start/stop in.
Closesscylladb/scylladb#22368
* github.com:scylladb/scylladb:
Revert "repair: add repair_service gate"
main: Start repair before storage service
repair: Check for sharded<view-builder> when constructing row_level_repair
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated. The series factors out peers table update code from
sync_raft_topology_nodes() and calls it on topology and ip address
updates. As a side effect it fixes#22293 since now topology loading
does not require IP do be present, so the assert that is triggered in
this bug is removed.
Fixes: scylladb/scylladb#22293Closesscylladb/scylladb#22519
* github.com:scylladb/scylladb:
topology coordinator: do not update topology on address change
topology coordinator: split out the peer table update functionality from raft state application
Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in `topology_coordinator::handle_global_request()` while `topology::tstate` remains empty. This creates problems because `topology::is_busy()` uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing.
This change introduces a new topology transition `topology::transition_state::truncate_table` and moves the truncate logic to a new method `topology_coordinator::handle_truncate_table()`. This method is now called as a handler of the `truncate_table` transition state instead of a handler of the `trunacate_table` global topology request.
This PR is a bugfix for truncate with tables and needs to be backported to 2025.1
Closesscylladb/scylladb#22452
* github.com:scylladb/scylladb:
truncate: trigger truncate logic from transition state instead of global request handler
truncate: add truncate_table transition state
It is almost always a bad idea to run removenode force. This means a
node is removed without the remaining nodes to stream data that they
should own after the removal. This will make the cluster into a worse
state than a node being down.
One can use one of the following procedure instead:
1) Fix the dead node and move it back to the cluster
2) Run replace ops to replace the dead node
3) Run removenode ops again
We have seen misuse of nodetool removenode force by users again and
again. This patch rejects it so it can not be misused anymore.
Fixesscylladb/scylladb#15833Closesscylladb/scylladb#15834
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated, so call a function that does peers table update only.
Raft topology state application does two things: re-creates token metadata
and updates peers table if needed. The code for both task is intermixed
now. The patch separates it into separate functions. Will be needed in
the next patch.
There's such a reference on storage_service itself, it can use this->_sys_dist_ks instead thus making its API (both internal and external) a bit simpler.
Closesscylladb/scylladb#22483
* github.com:scylladb/scylladb:
storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator()
storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber()
storage_service: Drop sys_dist_ks argument from join_topology()
storage_service: Drop sys_dist_ks argument from join_cluster()
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22467
This reverts commit 32ab58cdea.
Now repair service starts before and stops after storage server, so the
problem described in the commit is no longer relevant.
request handler
Before this change, the logic of truncate for tablets was triggered from
topology_coordinator::handle_global_request(). This was done without
using a topology transition state which remained empty throughout the
truncate handler's execution.
This change moves the truncate logic to a new method
topology_coordinator::handle_truncate_table(). This method is now called
as a handler of the truncate_table topology transition state instead of
a handler of the trunacate_table global topology request.
Truncate table for tablets is implemented as a global topology operation.
However, it does not have a transition state associated with it, and
performs the truncate logic in handle_global_request() while
topology::tstate remains empty. This creates problems because
topology::is_busy() uses transition_state to determine if the topology
state machine is busy, and will return false even though a truncate
operation is ongoing.
This change adds a new transition state: truncate_table
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.
Fixes https://github.com/scylladb/scylladb/issues/21354
This issue is a bug, so adding backport labels 6.1 and 6.2
Closesscylladb/scylladb#21862
* github.com:scylladb/scylladb:
test: add test for schema registry maintaining base info for views
schema_registry: avoid setting base info when getting the schema from registry
schema_registry: update cached base schemas when updating a view
schema_registry: cache base schemas for views
db: set base info before adding schema to registry
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
New feature. No backport is needed.
Closesscylladb/scylladb#21896
* github.com:scylladb/scylladb:
repair: Stop using rpc to update repair time for repairs scheduled by scheduler
repair: Wire repair_time in system.tablets for tombstone gc
test: Disable flush_cache_time for two tablet repair tests
test: Introduce guarantee_repair_time_next_second helper
repair: Return repair time for repair_service::repair_tablet
service: Add tablet_operation.hh
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
Replace remaining uses of boost::adaptors::transformed with std::views::transform
to reduce Boost dependencies, following the migration pattern established in
bab12e3a. This change addresses recently merged code that reintroduced Boost
header dependencies through boost::adaptors::transformed usage.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22365
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).
Users can see running resize tasks - finished tasks are not
presented with the task manager API.
A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.
Fixes: #21366.
Fixes: #21367.
No backport, new feature
Closesscylladb/scylladb#21891
* github.com:scylladb/scylladb:
test: boost: check resize_task_info in tablet_test.cc
test: add tests to check revoked resize virtual tasks
test: add tests to check the list of resize virtual tasks
test: add tests to check spilt and merge virtual tasks status
test: test_tablet_tasks: generalize functions
replica: service: add split virtual task's children
replica: service: pass parent info down to storage_group::split
tasks: children of virtual tasks aren't internal by default
tasks: initialize shard in task_info ctor
service: extend tablet_virtual_task::abort
service: retrun status_helper struct from tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::wait
tasks: add suspended task state
service: extend tablet_virtual_task::get_status
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: add service::task_manager_module::get_nodes
tasks: add task_manager::get_nodes
tasks: drop noexcept from module::get_nodes
replica: service: add resize_task_info static column to system.tablets
locator: extend tablet_task_info to cover resize tasks
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:
Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management
Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup
I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.
Notably, there should be no non-whitespace changes between this and scylla-enterprise
Fixesscylladb/scylla-enterprise#4999Closesscylladb/scylladb#22147
* github.com:scylladb/scylladb:
audit: Add shares support to service level management
audit: Add service level support to CQL login process
audit: Add support to CQL statements
audit: Integrate audit subsystem into Scylla main process
audit: Add documentation for the audit subsystem
audit: Add the audit subsystem
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.
Fixes: scylladb/scylla#21777
* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
hint manager: do not translate ip to id in case hint manager is stopped already
locator: token_metadata: drop update_host_id() function that does nothing now
locator: topology: drop indexing by ips
repair: drop unneeded code
storage_service: use host_id to look for a node in on_alive handler
storage_proxy: translate ips to ids in forward array using gossiper
locator: topology: remove unused functions
storage_service: check for outdated ip in on_change notification in the peers table
storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
topology coordinator: change connection dropping code to work on host ids
cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
locator: drop unused function from tablet_effective_replication_map
api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
locator: token_metadata: remove unused ip based functions
locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
gossiper: drop get_unreachable_token_owners functions
storage_service: use gossiper to map ip to id in node_ops operations
storage_service: fix indentation after the last patch
storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
token_metadata: drop no longer used functions
...
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
The code checks that it does not run for an ip address that is no longer
in use (after ip address change). To check that we can use peers table
and see if the host id is mapped to the address. If yes, this is the
latest address for this host id otherwise this is an outdated entry.
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.
Replace operation is special though. In case of replacing with the same
IP the gossiper will not have the mapping, and node_ops RPC
unfortunately does not send host id of a replaced node. For replace we
consult peers table instead to find the old owner of the IP. A node that
is replacing (the coordinator of the replace) will not have it though,
but luckily it is not needed since it updates metadata during
join_topology() anyway. The only thing that is missing there is
add_replacing_endpoint() call which the patch adds.
host_id_or_endpoint is a helper class that hold either id or ip and
translate one into another on demand. Use gossiper to do a translation
there instead of token_metadata since we want to drop ip based APIs from
the later.