scylla

Author	SHA1	Message	Date
Kefu Chai	9a20fb43ab	tree: replace boost::min_element() with std::ranges::min_element() in order to reduce the external header dependency, let's switch to the standardlized std::ranges::min_element(). Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22572	2025-02-05 21:54:01 +02:00
Botond Dénes	f2d5819645	reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload with_permit() creates a permit, with a self-reference, to avoid attaching a continuation to the permit's run function. This self-reference is used to keep the permit alive, until the execution loop processes it. This self reference has to be carefully cleared on error-paths, otherwise the permit will become a zombie, effectively leaking memory. Instead of trying to handle all loose ends, get rid of this self-reference altogether: ask caller to provide a place to save the permit, where it will survive until the end of the call. This makes the call-site a little bit less nice, but it gets rid of a whole class of possible bugs. Fixes: #22588 Closes scylladb/scylladb#22624	2025-02-04 21:27:16 +02:00
Ran Regev	edd56a2c1c	moved cache files to db As requested in #22097, moved the files and fixed other includes and build system. Fixes: #22097 Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#22495	2025-02-04 12:21:31 +03:00
Botond Dénes	af46894bb7	Merge 'Rack aware view pairing' from Benny Halevy Enabled with the tablets_rack_aware_view_pairing cluster feature rack-aware pairing pairs base to view replicas that are in the same dc and rack, using their ordinality in the replica map We distinguish between 2 cases: - Simple rack-aware pairing: when the replication factor in the dc is a multiple of the number of racks and the minimum number of nodes per rack in the dc is greater than or equal to rf / nr_racks. In this case (that includes the single rack case), all racks would have the same number of replicas, so we first filter all replicas by dc and rack, retaining their ordinality in the process, and finally, we pair between the base replicas and view replicas, that are in the same rack, using their original order in the tablet-map replica set. For example, nr_racks=2, rf=4: base_replicas = { N00, N01, N10, N11 } view_replicas = { N11, N12, N01, N02 } pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 } Note that we don't optimize for self-pairing if it breaks pairing ordinality. - Complex rack-aware pairing: when the replication factor is not a multiple of nr_racks. In this case, we attempt best-match pairing in all racks, using the minimum number of base or view replicas in each rack (given their global ordinality), while pairing all the other replicas, across racks, sorted by their ordinality. For example, nr_racks=4, rf=3: base_replicas = { N00, N10, N20 } view_replicas = { N11, N21, N31 } pairing would be: { N00, N31 }\, { N10, N11 }, { N20, N21 } \ cross-rack pair If we'd simply stable-sort both base and view replicas by rack, we might end up with much worse pairing across racks: { N00, N11 }\, { N10, N21 }\, { N20, N31 }\* \* cross-rack pair Fixes scylladb/scylladb#17147 * This is an improvement so no backport is required Closes scylladb/scylladb#21453 * github.com:scylladb/scylladb: network_topology_strategy_test: add tablets rack_aware_view_pairing tests view: get_view_natural_endpoint: implement rack-aware pairing for tablets view: get_view_natural_endpoint: handle case when there are too few view replicas view: get_view_natural_endpoint: track replica locator::nodes locator: topology: consult local_dc_rack if node not found by host_id locator: node: add dc and rack getters feature_service: add tablet_rack_aware_view_pairing feature view: get_view_natural_endpoint: refactor predicate function view: get_view_natural_endpoint: clarify documentation view: mutate_MV: optimize remote_endpoints filtering check view: mutate_MV: lookup base and view erms synchronously view: mutate_MV: calculate keyspace-dependent flags once	2025-01-30 11:32:19 +02:00
Aleksandra Martyniuk	328818a50f	replica: mark registry entry as synch after the table is added When a replica get a write request it performs get_schema_for_write, which waits until the schema is synced. However, database::add_column_family marks a schema as synced before the table is added. Hence, the write may see the schema as synced, but hit no_such_column_family as the table hasn't been added yet. Mark schema as synced after the table is added to database::_tables_metadata. Fixes: #22347. Closes scylladb/scylladb#22348	2025-01-30 11:30:07 +02:00
Kefu Chai	57b14220ce	tree: remove unused "#include"s these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`, use `using seastar::future` to include `future` in the global namespace, this makes `readers/mutation_reader.hh` a header exposing `future<>`, but this is not a good practice, because, unlike `seastarx.hh` or `seastar/core/future.hh`, `reader/mutation_reader.hh` is not responsible for exposing seastar declarations. so, we trade the using statement for `#include "seastarx.hh"` in that file to decouple the source files including it from this header because of this statement. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22439	2025-01-28 14:12:06 +03:00
Avi Kivity	6b85c03221	Merge 'split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Ferenc Szili `tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that `tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but `tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE Fixes #22431 This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1 Closes scylladb/scylladb#22330 * github.com:scylladb/scylladb: test: add reproducer and test for fix to split ready CG creation table: run set_split_mode() on all storage groups during all_storage_groups_split()	2025-01-27 13:13:42 +01:00
Benny Halevy	23284f038f	table: flush: synchronize with stop() When the table is stopped, all compaction groups are stopped, and as part of that, they are flushing their memtables. To synchronize with stop-induced flush operation, move _pending_flushes_phaser.stop() later in table::stop(), after all compaction groups are flushed and stopped. This way, in table::flush, if we see that the phaser is already closed, we know that there is nothing to flush, otherwise we start a flush operation that would be waited on by a parallel table::stop(). Fixes #22243 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22339	2025-01-22 09:23:09 +02:00
Benny Halevy	0e388a1594	view: get_view_natural_endpoint: handle case when there are too few view replicas Currently, when reducing RF, we may drop replicas from the view before dropping replicas from the base table. Since get_view_natural_endpoint is allowed to return a disengaged optional if it can't find a pair for the base replica, replcace the exiting assertion with code handling this case, and count those events in a new table metric: total_view_updates_failed_pairing. Note that this does not fix the root cause for the issue which is the unsynchronized dropping of replicas, that should be atomic, using a single group0 transaction. Refs scylladb/scylladb#21492 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Ferenc Szili	8bff7786a8	test: add reproducer and test for fix to split ready CG creation This adds a reproducer for #22431 In cases where a tablet storage group manager had more than one storage group, it was possible to create compaction groups outside the group0 guard, which could create problems with operations which should exclude with compaction group creation.	2025-01-21 18:43:10 +01:00
Ferenc Szili	24e8d2a55c	table: run set_split_mode() on all storage groups during all_storage_groups_split() tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode() for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using std::ranges::all_of() which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurance of the predicate (set_split_mode()) returning false. set_split_mode() creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in tablet_storage_group_manager::split_all_storage_groups() which also calls set_split_mode(), and that is the reason why split completes successfully. The problem is that tablet_storage_group_manager::all_storage_groups_split() runs under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups() does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE	2025-01-21 18:42:53 +01:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Tomasz Grabiec	c7f78edc78	Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507 New feature. No backport is needed. Closes scylladb/scylladb#21896 * github.com:scylladb/scylladb: repair: Stop using rpc to update repair time for repairs scheduled by scheduler repair: Wire repair_time in system.tablets for tombstone gc test: Disable flush_cache_time for two tablet repair tests test: Introduce guarantee_repair_time_next_second helper repair: Return repair time for repair_service::repair_tablet service: Add tablet_operation.hh	2025-01-20 18:08:49 +01:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Botond Dénes	55963f8f79	replica: remove noexcept from token -> tablet resolution path The methods to resolve a key/token/range to a table are all noexcept. Yet the method below all of these, `storage_group_for_id()` can throw. This means that if due to any mistake a tablet without local replica is attempted to be looked up, it will result in a crash, as the exception bubbles up into the noexcept methods. There is no value in pretending that looking up the tablet replica is noexcept, remove the noexcept specifiers so that any bad lookup only fails the operation at hand and doesn't crash the node. This is especially relevant to replace, which still has a window where writes can arrive for tablets that don't (yet) have a local replica. Currently, this results in a crash. After this patch, this will only fail the writes and the replace can move on. Fixes: #21480 Closes scylladb/scylladb#22251	2025-01-17 11:24:09 +03:00
Asias He	53e6025aa6	repair: Wire repair_time in system.tablets for tombstone gc The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507	2025-01-17 16:12:05 +08:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Asias He	cd96fb5a78	repair: Add repair_hosts_filter and repair_dcs_filter They will be useful for hosts and DCs selection for the repair scheduler. It is not implemented yet. Adding it earlier, so we do not need to change the system tabler later. Closes scylladb/scylladb#21985	2025-01-14 08:46:26 +02:00
Aleksandra Martyniuk	062f155fd6	replica: service: add split virtual task's children offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split become children of a split virtual task.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	7ef6900837	replica: service: pass parent info down to storage_group::split Pass task_info down to storage_group::split. In the following patches, it will be used to set the parent of offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split. The task_info param will contain task info of a split virtual task.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	18b829add8	replica: service: add resize_task_info static column to system.tablets Add resize_task_info static column to system.tablets. Set or delete resize_task_info value when the resize_decision is changed. Reflect the column content in tablet_map.	2025-01-10 10:03:07 +01:00
Kefu Chai	569f8e9246	treewide: fix misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22154	2025-01-05 16:13:09 +02:00
Tomasz Grabiec	4c89e62470	Merge 'Phased barrier improvements' from Benny Halevy - utils: phased_barrier: advance_and_await: allocate new gate only when needed - utils: phased_barrier: add close() method - and use in existing services * Improvement. No backport needed Closes scylladb/scylladb#22018 * github.com:scylladb/scylladb: utils: phased_barrier: add close() method utils: phased_barrier: advance_and_await: allocate new gate only when needed	2025-01-03 18:51:23 +01:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Avi Kivity	4905b1bf76	Merge 'table: make update_effective_replication_map sync again' from Benny Halevy Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a member and storage_group_manager::stop() consumes this future during table::stop(). - storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. * Regression was introduced in `f2ff701489`, so backport is required to 6.x, 2024.2 Closes scylladb/scylladb#21781 * github.com:scylladb/scylladb: storage_service: replicate_to_all_cores: clear_gently pending erms test_mv_topology_change: drop delay_after_erm_update injection case storage_service: replicate_to_all_cores: update base and view tables atomically table: make update_effective_replication_map sync again	2024-12-30 23:42:06 +02:00
Wojciech Mitros	82f2e1b44c	schema_registry: update cached base schemas when updating a view The schema registry now holds base schemas for view schemas. The base schema may change without changing the view schema, so to preserve the change in the schema registry, we also update the base schema in the registry when updating the base info in the view schema.	2024-12-30 14:56:18 +01:00
Wojciech Mitros	6f11edbf3f	db: set base info before adding schema to registry In the following patches, we'll assure that view schemas returned by the schema registry always have base info set. To prepare for that, make sure that the base info is always set before inserting it into schema registry,	2024-12-30 14:56:17 +01:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Benny Halevy	a25c3eaa1c	utils: phased_barrier: add close() method When services are stopped we generally want to call advance_and_await(), but we should also prevent starting new operations, so close() would do that be closing the phased_barrier active gate (which implicitly also awaits past operations similar to advance_and_await()). Add unit tests for that and use in existing services. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-26 06:54:07 +02:00
Takuya ASADA	03461d6a54	test: compile unit tests into a single executable To reduce test executable size and speed up compilation time, compile unit tests into a single executable. Here is a file size comparison of the unit test executable: - Before applying the patch $ du -h --exclude='.o' --exclude='.o.d' build/release/test/boost/ build/debug/test/boost/ 11G build/release/test/boost/ 29G build/debug/test/boost/ - After applying the patch du -h --exclude='.o' --exclude='.o.d' build/release/test/boost/ build/debug/test/boost/ 5.5G build/release/test/boost/ 19G build/debug/test/boost/ It reduces executable sizes 5.5GB on release, and 10GB on debug. Closes #9155 Closes scylladb/scylladb#21443	2024-12-22 19:14:09 +02:00
Aleksandra Martyniuk	1c29726477	replica: do not set tablet_task_info if it isn't valid Currently, in tablet_map_to_mutation, repair's and migration's tablet_task_info is always set. Do not set the tablet_task_info if there is no running operation. Closes scylladb/scylladb#22005	2024-12-20 16:10:53 +02:00
Pavel Emelyanov	bb094cc099	Merge 'Make restore task abortable' from Calle Wilund Fixes #20717 Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data. Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()". Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader. There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored. Closes scylladb/scylladb#21567 * github.com:scylladb/scylladb: test_backup: Add restore abort test case sstables_loader: Make restore task abortable distributed_loader: Add optional abort_source to get_sstables_from_object_store s3_storage: Add optional abort_source to params/object s3::client: Make "readable_file" abortable	2024-12-19 12:23:33 +03:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Kefu Chai	e65fc35b5e	replica: do not include unused headers these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21836	2024-12-18 13:52:57 +02:00
Avi Kivity	5a849b0a6a	Merge "Move more subsystems to use host ids instead of ips" from Gleb " This series converts repair, streaming and node_ops (and some parts of alternator) to work on host ids instead of ips. This allows to remove a lot of (but not all) functions that work on ips from effective replication map. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/ Refs: scylladb/scylladb#21777 " * 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev: locator: topology: remove no longer use get_all_ips() gossiper: change get_unreachable_nodes to host ids locator: drop no longer used ip based functions from effective replication map and friends test: move network_topology_strategy_test and token_metadata_test to use host id based APIs replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges replica/mutation_dump: use host ids instead of ips alternator: move ttl to work with host ids instead of ips storage_service: move node_ops code to use host ids instead of host ips streaming: move streaming code to use host ids instead of host ips repair: move repair code to use host ids instead of host ips gossiper: add get_unreachable_host_ids() function locator: topology: add more function that return host ids to effective replication map locator: add more function that return host ids to effective replication map	2024-12-18 13:48:22 +02:00
Aleksandra Martyniuk	d0cda8ebef	replica: check enabled features in tablet_map_to_mutation Before adding a value to a new column in tablet_map_to_mutation check if the column is supported by the whole cluster. Closes scylladb/scylladb#21941	2024-12-17 07:02:11 +02:00
Raphael S. Carvalho	013e0d53ff	replica: Fix use-after-free due to a race between split and cleanup There is an assumption that every destroyed compaction_group will be stopped first. Otherwise, the group is still referenced by compaction manager and can use it after freed. That's what happened in issue #21867 in the context of merge. The issue is pre-existing but was made more likely with merge. One problem is a race between split and cleanup, where if split is emitted while cleanup is stopping groups, it can happen split preparation adds new groups that will never be closed, since cleanup is already past the group stopping step. Another problem found is that split completion handler is not accounting for possible existence of merging groups, if split happens right after merge. Split completion handler should stop all empty groups that previously had data split from them. The problems will be fixed by guaranteeing that new groups will not be added for a tablet being migrated away, and that empty groups are properly closed when handling split completion. A reproducer was added. Fixes #21867. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#21920	2024-12-16 13:19:26 +02:00
Benny Halevy	10c4cf930c	table: make update_effective_replication_map sync again Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a memebr and storage_group_manager::stop() consumes this future during table::stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 11:45:08 +02:00
Gleb Natapov	ca55d1e658	replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges	2024-12-15 11:31:11 +02:00
Gleb Natapov	77f8abb19a	replica/mutation_dump: use host ids instead of ips	2024-12-15 11:31:11 +02:00
Aleksandra Martyniuk	3f9c76c52d	replica: add repair related fields to tablet_map_to_mutation	2024-12-12 11:40:40 +01:00
Aleksandra Martyniuk	9fad3a621a	replica: service: add migration_task_info column to system.tablets Add migration_task_info column to system.tablets. Set migration_task_info value on migration request if the feature is enabled in the cluster. Reflect the column content in tablet_metadata.	2024-12-11 12:07:36 +01:00
Botond Dénes	5d040e0206	Merge 'truncate: commit log replay positions are not saved correctly' from Ferenc Szili TRUNCATE TABLE saves the current commit log replay positions in case there is a crash so that replay knows where to begin replaying the mutations. These are collected and saved per shard into `system.truncated`. In case a shard received no mutations, its replay position will be an empty, default constructed object of type `db::replay_position` with its members set to 0. Truncate will incorrectly interpret these empty replay positions as if they were coming from shard 0, and save them as such, potentially overwriting an actual valid replay position coming from the actual shard 0. In the case of a crash, this will cause the commit log on shard 0 to be replayed from the beginning, and result with data resurrection. Fixes #21719 Closes scylladb/scylladb#21722 * github.com:scylladb/scylladb: test: add test for truncate saving replay positions database: correctly save replay position for truncate	2024-12-10 10:05:30 +02:00
Botond Dénes	924189c50e	Merge 'replica/table: improve error message when encountering orphaned sstables' from Lakshmi Narayanan Sreethar On startup, if a server reads an sstable that belongs to a tablet that doesn't have any local replica, it throws an error in the following format and refuses to start : ``` Storage wasn't found for tablet 1 of table test.test ``` This patch updates the code path to throw a nicer error that includes the sstable name that caused the problem. This patch also adds a testcase to verify the error being thrown. Fixes https://github.com/scylladb/scylladb/issues/18038 PR improves an error message - no need to backport. Closes scylladb/scylladb#21805 * github.com:scylladb/scylladb: replica/table: fix indent in compaction_group_for_sstable replica/table: improve error message when encountering orphaned sstables	2024-12-10 06:34:12 +02:00
Kefu Chai	ce2f80c227	treewide: migrate from boost::make_iterator_range to ranges::subrange Replace boost::make_iterator_range() with std::ranges::subrange. This change improves code modernization and reduces external dependencies: - Replace boost::make_iterator_range() with std::ranges::subrange - Remove boost/range/iterator_range.hpp include - Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range> This is part of ongoing efforts to modernize our codebase and minimize external dependencies. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21787	2024-12-09 21:31:53 +02:00
Pavel Emelyanov	6eb6b96456	dirty-memory-manager: Brush up "blocked" state check One of run_when_memory_available() checks mirrors the one done by the execution_permitted() helper, so its worth re-using it. Since the former helper is header template, the latter is worth moving to header too. And, once re-used, the `bool blocking` variable becomes excessive, and the `if (blocking)` check can also be expressed with fewer LOCs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#21812	2024-12-09 20:44:22 +02:00
Kefu Chai	48c8d24345	treewide: drop support for fmt < v10 since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also, we've switched to the build image based on fedora 40, which ships fmt-devel v10.2.1, there is no need to support fmt < 10. in this change, we drop the support fmt < 10. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21847	2024-12-09 20:42:38 +02:00
Tomasz Grabiec	7e2875d648	Merge 'Add tablet merge support' from Raphael Raph Carvalho The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now. Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges. Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off. Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings. The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do. While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations. When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one. Fixes #18181. system test details: test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml instance type: i3.8xlarge nodes: 3 target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges) description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges. data_set_size: ~100G initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge. latency of reads and writes that happened in parallel to split and merge: ``` $ for i in scylla-bench; do cat $i \| grep "Mode\\|99th:\\|99\.9th:"; done Mode: write 99.9th: 3.145727ms 99th: 1.998847ms 99.9th: 3.145727ms 99th: 2.031615ms Mode: read 99.9th: 3.145727ms 99th: 2.031615ms 99.9th: 3.145727ms 99th: 2.031615ms Mode: write 99.9th: 3.047423ms 99th: 1.933311ms 99.9th: 3.047423ms 99th: 1.933311ms Mode: read 99.9th: 3.145727ms 99th: 1.900543ms 99.9th: 3.145727ms 99th: 1.900543ms Mode: write 99.9th: 5.079039ms 99th: 3.604479ms 99.9th: 35.389439ms 99th: 25.624575ms Mode: write 99.9th: 3.047423ms 99th: 1.998847ms 99.9th: 3.047423ms 99th: 1.998847ms Mode: read 99.9th: 3.080191ms 99th: 2.031615ms 99.9th: 3.112959ms 99th: 2.031615ms ``` Closes scylladb/scylladb#20572 github.com:scylladb/scylladb: docs: Document tablet merging tests/boost: Add test to verify correctness of balancer decisions during merge tests/topology_experimental_raft: Add tablet merge test service: Handle exception when retrying split service: Co-locate sibling tablets for a table undergoing merge gms: Add cluster feature for tablet merge service: Make merge of resize plan commutative replica: Implement merging of compaction groups on merge completion replica: Handle tablet merge completion service: Implement tablet map resize for merge locator: Introduce merge_tablet_info() service: Rename topology::transition_state::tablet_split_finalization service: Respect initial_tablet_count if table is in growing mode service: Wire migration_tablet_set into the load balancer locator: Add tablet_map::sibling_tablets() service: Introduce sorted_replicas_for_tablet_load() locator/tablets: Extend tablet_replica equality comparator to three-way service: Introduce alias to per-table candidate map type service: Add replication constraint check variant for migration_tablet_set service: Add convergence check variant for migration_tablet_set service: Add migration helpers for migration_tablet_set service/tablet_allocator: Introduce migration_tablet_set service: Introduce migration_plan::add(migrations_vector) locator/tablets: Introduce tablet_map::for_each_sibling_tablets() locator/tablets: Introduce tablet_map::needs_merge() locator/tablets: Introduce resize_decision::initial_decision() locator/tablets: Fix return type of three-way comparison operators service: Extract update of node load on migrations service: Extract converge check for intra-node migration service: Extract erase of tablet replicas from candidate list scripts/tablet-mon: Allow visualization of tablet id	2024-12-06 18:06:20 +01:00
Lakshmi Narayanan Sreethar	401e7c8f69	replica/table: fix indent in compaction_group_for_sstable Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-12-06 21:22:24 +05:30
Lakshmi Narayanan Sreethar	fa10b0b390	replica/table: improve error message when encountering orphaned sstables On startup, if a server reads an sstable that belongs to a tablet that doesn't have any local replica, it throws an error in the following format and refuses to start : ``` Storage wasn't found for tablet 1 of table test.test ``` This patch updates the code path to throw a nicer error that includes the sstable name that caused the problem. This patch also adds a testcase to verify the error being thrown. Fixes #18038	2024-12-06 21:22:24 +05:30

1 2 3 4 5 ...

1448 Commits