scylla

Author	SHA1	Message	Date
Avi Kivity	9712390336	Merge 'Add per-table tablet options in schema' from Benny Halevy This series extends the table schema with per-table tablet options. The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions, when the table size changes. * New feature, no backport required Closes scylladb/scylladb#22090 * github.com:scylladb/scylladb: tablets: resize_decision: get rid of initial_decision tablet_allocator: consider tablet options for resize decision tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member network_topology_strategy: allocate_tablets_for_new_table: consider tablet options network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0 cql3/create_keyspace_statement: add deprecation warning for initial tablets test: cqlpy: test_tablets: add tests for per-table tablet options schema: add per-table tablet options feature_service: add TABLET_OPTIONS cluster schema feature	2025-02-08 20:32:19 +02:00
Alexey Novikov	cc35905531	Allow to use memtable_flush_period_in_ms schema option for system tables It's possible to modify 'memtable_flush_period_in_ms' option only and as single option, not with any other options together Refs #20999 Fixes #21223 Closes scylladb/scylladb#22536	2025-02-07 10:33:05 +02:00
Benny Halevy	20c6ca2813	tablet_allocator: consider tablet options for resize decision Do not merge tablets if that would drop the tablet_count below the minimum provided by hints. Split tablets if the current tablet_count is less than the minimum tablet count calculated using the table's tablet options. TODO: override min_tablet_count if the tablet count per shard is greater than the maximum allowed. In this case the tables tablet counts should be scaled down proportionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:35 +02:00
Benny Halevy	559f083dc6	tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member Rather than target_max_tablet_size. We need both the target as well as max and min tablet sizes, so there is no sense in keeping the max and deriving the target and the minimum for the max value. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Benny Halevy	32c2f7579f	network_topology_strategy: allocate_tablets_for_new_table: consider tablet options Use the keyspace initial_tablets for min_tablet_count, if the latter isn't set, then take the maximum of the option-based tablet counts: - min_tablet_count - and expected_data_size_in_gb / target_tablet_size - min_per_shard_tablet_count (via calculate_initial_tablets_from_topology) If none of the hints produce a positive tablet_count, fall back to calculate_initial_tablets_from_topology * initial_scale. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Botond Dénes	7ce932ce01	service: query_pager: fix last-position for filtering queries On short-pages, cut short because of a tombstone prefix. When page-results are filtered and the filter drops some rows, the last-position is taken from the page visitor, which does the filtering. This means that last partition and row position will be that of the last row the filter saw. This will not match the last position of the replica, when the replica cut the page due to tombstones. When fetching the next page, this means that all the tombstone suffix of the last page, will be re-fetched. Worse still: the last position of the next page will not match that of the saved reader left on the replica, so the saved reader will be dropped and a new one created from scratch. This wasted work will show up as elevated tail latencies. Fix by always taking the last position from raw query results. Fixes: #22620 Closes scylladb/scylladb#22622	2025-02-05 17:23:30 +02:00
Ferenc Szili	a59618e83d	truncate: create session during request handling Currently, the session ID under which the truncate for tablets request is running is created during the request creation and queuing. This is a problem because this could overwrite the session ID of any ongoing operation on system.topology#session This change moves the creation of the session ID for truncate from the request creation to the request handling. Fixes #22613 Closes scylladb/scylladb#22615	2025-02-04 22:11:24 +01:00
Aleksandra Martyniuk	683176d3db	tasks: add shard, start_time, and end_time to task_stats task_stats contains short info about a task. To get a list of task_stats in the module, one needs to request /task_manager/list_module_tasks/{module}. To make identification and navigation between tasks easier, extend task_stats to contain shard, start_time, and end_time. Closes scylladb/scylladb#22351	2025-02-04 12:11:24 +02:00
Botond Dénes	8c8db2052e	Merge 'service: add child for tablet repair virtual task' from Aleksandra Martyniuk tablet_repair_task_impl is run as a part of tablet repair. Make it a child of tablet repair virtual task. tablet_repair_task_impl started by /storage_service/repair_async API (vnode repair) does not have a parent, as it is the top-level task in that case. No backport needed; new functionality Closes scylladb/scylladb#22372 * github.com:scylladb/scylladb: test: add test to check tablet repair child service: add child for tablet repair virtual task	2025-02-04 12:08:24 +02:00
Aleksandra Martyniuk	610a761ca2	service: use read barrier in tablet_virtual_task::contains Currently, when the tablet repair is started, info regarding the operation is kept in the system.tablets. The new tablet states are reflected in memory after load_topology_state is called. Before that, the data in the table and the memory aren't consistent. To check the supported operations, tablet_virtual_task uses in-memory tablet_metadata. Hence, it may not see the operation, even though its info is already kept in system.tablets table. Run read barrier in tablet_virtual_task::contains to ensure it will see the latest data. Add a test to check it. Fixes: #21975. Closes scylladb/scylladb#21995	2025-02-04 12:07:42 +02:00
Aleksandra Martyniuk	c23ce40f50	service: add child for tablet repair virtual task tablet_repair_task_impl is run as a part of tablet repair. Make it a child of tablet repair virtual task. tablet_repair_task_impl started by /storage_service/repair_async API (vnode repair) does not have a parent, as it is the top-level task in that case.	2025-02-03 10:31:14 +01:00
Botond Dénes	98fdf05b0e	Merge 'Fix repair vs storage services initialization order' from Pavel Emelyanov Repair service is started after storage service, while storage service needs to reference repair one for its needs. Recently it was noticed, that this reverse order may cause troubles and was fixed with the help of an extra gate. That's not nice and makes the start-stop mess even worse. The correct fix is to fix the order both services start/stop in. Closes scylladb/scylladb#22368 * github.com:scylladb/scylladb: Revert "repair: add repair_service gate" main: Start repair before storage service repair: Check for sharded<view-builder> when constructing row_level_repair	2025-01-30 11:26:24 +02:00
Kamil Braun	add97ccc15	Merge 'Do not update topology on address change' from Gleb Natapov Since now topology does not contain ip addresses there is no need to create topology on an ip address change. Only peers table has to be updated. The series factors out peers table update code from sync_raft_topology_nodes() and calls it on topology and ip address updates. As a side effect it fixes #22293 since now topology loading does not require IP do be present, so the assert that is triggered in this bug is removed. Fixes: scylladb/scylladb#22293 Closes scylladb/scylladb#22519 * github.com:scylladb/scylladb: topology coordinator: do not update topology on address change topology coordinator: split out the peer table update functionality from raft state application	2025-01-28 12:52:29 +01:00
Tomasz Grabiec	50d9d5b98e	Merge 'truncate: trigger truncate logic from a transition state instead of global topology request' from Ferenc Szili Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in `topology_coordinator::handle_global_request()` while `topology::tstate` remains empty. This creates problems because `topology::is_busy()` uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing. This change introduces a new topology transition `topology::transition_state::truncate_table` and moves the truncate logic to a new method `topology_coordinator::handle_truncate_table()`. This method is now called as a handler of the `truncate_table` transition state instead of a handler of the `trunacate_table` global topology request. This PR is a bugfix for truncate with tables and needs to be backported to 2025.1 Closes scylladb/scylladb#22452 * github.com:scylladb/scylladb: truncate: trigger truncate logic from transition state instead of global request handler truncate: add truncate_table transition state	2025-01-28 12:05:57 +01:00
Asias He	0ab64551c5	storage_service: Reject nodetool removenode force It is almost always a bad idea to run removenode force. This means a node is removed without the remaining nodes to stream data that they should own after the removal. This will make the cluster into a worse state than a node being down. One can use one of the following procedure instead: 1) Fix the dead node and move it back to the cluster 2) Run replace ops to replace the dead node 3) Run removenode ops again We have seen misuse of nodetool removenode force by users again and again. This patch rejects it so it can not be misused anymore. Fixes scylladb/scylladb#15833 Closes scylladb/scylladb#15834	2025-01-27 14:50:18 +01:00
Gleb Natapov	fbfef6b28a	topology coordinator: do not update topology on address change Since now topology does not contain ip addresses there is no need to create topology on an ip address change. Only peers table has to be updated, so call a function that does peers table update only.	2025-01-26 17:49:05 +02:00
Gleb Natapov	ef929c5def	topology coordinator: split out the peer table update functionality from raft state application Raft topology state application does two things: re-creates token metadata and updates peers table if needed. The code for both task is intermixed now. The patch separates it into separate functions. Will be needed in the next patch.	2025-01-26 17:47:38 +02:00
Avi Kivity	60cdf62fae	Merge 'Remove sharded<system_distributed_keyspace>& argument from storage_service::join_cluster()' from Pavel Emelyanov There's such a reference on storage_service itself, it can use this->_sys_dist_ks instead thus making its API (both internal and external) a bit simpler. Closes scylladb/scylladb#22483 * github.com:scylladb/scylladb: storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator() storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber() storage_service: Drop sys_dist_ks argument from join_topology() storage_service: Drop sys_dist_ks argument from join_cluster()	2025-01-26 15:56:37 +02:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Pavel Emelyanov	856832911d	storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator() It's unused argument. The only caller is relaxed too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:29:40 +03:00
Pavel Emelyanov	1e93f51977	storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber() And the final drop of that kind -- switch to using this->_sys_dist_ks here too Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:29:03 +03:00
Pavel Emelyanov	248456cb9a	storage_service: Drop sys_dist_ks argument from join_topology() Similarly to previous patch, there's this->_sys_dist_ks thing Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:28:27 +03:00
Pavel Emelyanov	ca9b59f3b2	storage_service: Drop sys_dist_ks argument from join_cluster() Storage service has _sys_dist_ks onboard and can just use it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:26:32 +03:00
Pavel Emelyanov	4edd327c4f	Revert "repair: add repair_service gate" This reverts commit `32ab58cdea`. Now repair service starts before and stops after storage server, so the problem described in the commit is no longer relevant.	2025-01-22 19:25:56 +03:00
Ferenc Szili	9fa254e9a8	truncate: trigger truncate logic from transition state instead of global request handler Before this change, the logic of truncate for tablets was triggered from topology_coordinator::handle_global_request(). This was done without using a topology transition state which remained empty throughout the truncate handler's execution. This change moves the truncate logic to a new method topology_coordinator::handle_truncate_table(). This method is now called as a handler of the truncate_table topology transition state instead of a handler of the trunacate_table global topology request.	2025-01-22 11:08:26 +01:00
Ferenc Szili	29ead7014e	truncate: add truncate_table transition state Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in handle_global_request() while topology::tstate remains empty. This creates problems because topology::is_busy() uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing. This change adds a new transition state: truncate_table	2025-01-22 10:44:36 +01:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Tomasz Grabiec	c7f78edc78	Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507 New feature. No backport is needed. Closes scylladb/scylladb#21896 * github.com:scylladb/scylladb: repair: Stop using rpc to update repair time for repairs scheduled by scheduler repair: Wire repair_time in system.tablets for tombstone gc test: Disable flush_cache_time for two tablet repair tests test: Introduce guarantee_repair_time_next_second helper repair: Return repair time for repair_service::repair_tablet service: Add tablet_operation.hh	2025-01-20 18:08:49 +01:00
Benny Halevy	88ae067ddb	everywhere: add skeletal support for the in_memory_tables feature Forward-ported from scylla-enterprise. Note that the feature has been deprecated and the implementation is provided only for backward compatibility with pre-existing features and schema. Tested manually after adding the following to feature_service: ``` gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv }; ``` Launched a single-node cluster running 2023.1.10 ``` cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'}; ``` log: ``` Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ... ... INFO 2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 ``` Upgraded to this branch and started scylla. Verified that ks.test was successfuly loaded: log: ``` INFO 2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ... ... WARN 2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy. ... INFO 2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8 ``` Then, tested: ``` cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'InMemoryCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE'; cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'}; cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE' AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'}; ``` log: ``` INFO 2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22068	2025-01-20 16:55:17 +02:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00
Kefu Chai	1ef2d9d076	tree: migrate from boost::adaptors::transformed to std::views::transform Replace remaining uses of boost::adaptors::transformed with std::views::transform to reduce Boost dependencies, following the migration pattern established in `bab12e3a`. This change addresses recently merged code that reintroduced Boost header dependencies through boost::adaptors::transformed usage. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22365	2025-01-17 16:56:40 +02:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Piotr Dulikowski	6aa962f5f4	Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski Introduces a comprehensive audit system to track database operations for security and compliance purposes. This change includes: Core Components: - New audit subsystem for logging database operations - Service level integration for proper resource management - CQL statement tracking with operation categories - Login process integration for tenant management Key Features: - Configurable audit logging (syslog/table) - Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN) - Selective auditing by keyspace/table - Password sanitization in audit logs - Service level shares support (1-1000) for workload prioritization - Proper lifecycle management and cleanup I ran the dtests for audit (manually enabled) and they pass. The in-repo tests pass. Notably, there should be no non-whitespace changes between this and scylla-enterprise Fixes scylladb/scylla-enterprise#4999 Closes scylladb/scylladb#22147 * github.com:scylladb/scylladb: audit: Add shares support to service level management audit: Add service level support to CQL login process audit: Add support to CQL statements audit: Integrate audit subsystem into Scylla main process audit: Add documentation for the audit subsystem audit: Add the audit subsystem	2025-01-17 13:14:55 +01:00
Kamil Braun	89ee2a6834	Merge 'drop ip addresses from token metadata' from Gleb Now that all topology related code uses host ids there is not point to maintain ip to id (and back) mappings in the token metadata. After the patch the mapping will be maintained in the gossiper only. The rest of the system will use host ids and in rare cases where translation is needed (mostly for UX compatibility reasons) the translation will be done using gossiper. Fixes: scylladb/scylla#21777 * 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits) hint manager: do not translate ip to id in case hint manager is stopped already locator: token_metadata: drop update_host_id() function that does nothing now locator: topology: drop indexing by ips repair: drop unneeded code storage_service: use host_id to look for a node in on_alive handler storage_proxy: translate ips to ids in forward array using gossiper locator: topology: remove unused functions storage_service: check for outdated ip in on_change notification in the peers table storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology topology coordinator: change connection dropping code to work on host ids cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query locator: drop unused function from tablet_effective_replication_map api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead locator: token_metadata: remove unused ip based functions locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs gossiper: drop get_unreachable_token_owners functions storage_service: use gossiper to map ip to id in node_ops operations storage_service: fix indentation after the last patch storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node token_metadata: drop no longer used functions ...	2025-01-17 11:00:52 +01:00
Asias He	53e6025aa6	repair: Wire repair_time in system.tablets for tombstone gc The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507	2025-01-17 16:12:05 +08:00
Asias He	614c3380c6	service: Add tablet_operation.hh A tablet_operation_result struct is added to track the result of a tablet operation.	2025-01-17 16:12:05 +08:00
Gleb Natapov	1e4b2f25dc	locator: token_metadata: drop update_host_id() function that does nothing now	2025-01-16 16:37:08 +02:00
Gleb Natapov	12da203cae	storage_service: use host_id to look for a node in on_alive handler	2025-01-16 16:37:08 +02:00
Gleb Natapov	d45ce6fa12	storage_proxy: translate ips to ids in forward array using gossiper We already use it to translate reply_to, so do it for consistency and to drop ip based API usage.	2025-01-16 16:37:08 +02:00
Gleb Natapov	fb28ff5176	storage_service: check for outdated ip in on_change notification in the peers table The code checks that it does not run for an ip address that is no longer in use (after ip address change). To check that we can use peers table and see if the host id is mapped to the address. If yes, this is the latest address for this host id otherwise this is an outdated entry.	2025-01-16 16:37:07 +02:00
Gleb Natapov	163099678e	storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology We want to drop ip from the locator::node.	2025-01-16 16:37:07 +02:00
Gleb Natapov	49fa1130ef	topology coordinator: change connection dropping code to work on host ids Do not use ip from topology::node, but look it up in address map instead. We want to drop ip from the topology::node.	2025-01-16 16:37:07 +02:00
Gleb Natapov	0ec9f7de64	gossiper: drop get_unreachable_token_owners functions It is used by truncate code only and even there it only check if the returned set is not empty. Check for dead token owners in the truncation code directly.	2025-01-16 16:37:07 +02:00
Gleb Natapov	a7a7cdcf42	storage_service: use gossiper to map ip to id in node_ops operations Replace operation is special though. In case of replacing with the same IP the gossiper will not have the mapping, and node_ops RPC unfortunately does not send host id of a replaced node. For replace we consult peers table instead to find the old owner of the IP. A node that is replacing (the coordinator of the replace) will not have it though, but luckily it is not needed since it updates metadata during join_topology() anyway. The only thing that is missing there is add_replacing_endpoint() call which the patch adds.	2025-01-16 16:37:07 +02:00
Gleb Natapov	0db6136fa5	storage_service: fix indentation after the last patch	2025-01-16 16:37:07 +02:00
Gleb Natapov	9197b88e48	storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node The call already throw an error if there are more than one. Throw is there are zero as well and drop the loops.	2025-01-16 16:37:07 +02:00
Gleb Natapov	7c4c485651	host_id_or_endpoint: use gossiper to resolve ip to id and back mappings host_id_or_endpoint is a helper class that hold either id or ip and translate one into another on demand. Use gossiper to do a translation there instead of token_metadata since we want to drop ip based APIs from the later.	2025-01-16 16:37:07 +02:00
Gleb Natapov	70cc014307	storage_service: ip_address_updater: check peers table instead of token_metadata whether ip was changed As part of changing IP address peers table is updated. If it has a new address the update can be skipped.	2025-01-16 16:37:07 +02:00
Gleb Natapov	8e55cc6c78	storage_service: fix logging When logger outputs a range it already does join, so no other join is needed.	2025-01-16 16:37:07 +02:00

1 2 3 4 5 ...

5158 Commits