scylla

Author	SHA1	Message	Date
Gleb Natapov	0e80c5162a	storage_service: avoid unneeded copies in on_change Move array of strings instead of copying.	2023-05-25 14:51:14 +03:00
Gleb Natapov	3a201c25c8	storage_service: remove check that is always true The array cannot be empty since we access the first element of the array before we call this function.	2023-05-25 14:50:23 +03:00
Gleb Natapov	715897ff31	storage_service: rename handle_state_removing to handle_state_removed The function no longer handles REMOVING_TOKING state so rename the function and drop no longer needed checks for the non existing state.	2023-05-25 14:48:58 +03:00
Gleb Natapov	4103281648	storage_service: avoid string copy	2023-05-25 14:48:39 +03:00
Gleb Natapov	05aa07835d	storage_service: delete code that handled REMOVING_TOKENS state The state is never advertised so the code is never used.	2023-05-25 14:48:09 +03:00
Tomasz Grabiec	809ddd7f79	Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`. We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`. At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring. The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key instead of duplicating them for each `vnode` token. The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation. Closes #13715 * github.com:scylladb/scylladb: token_metadata_test: add a test for everywhere strategy token_metadata_test: check read_endpoints when bootstrapping first node token_metadata_test: refactor tests, extract create_erm token_metadata: drop has_pending_ranges and migration_info effective_replication_map: add has_pending_ranges token_metadata: drop update_pending_ranges effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading calculate_effective_replication_map: compute pending_endpoints and read_endpoints vnode_erm: optimize replication_map vnode_erm::get_range_addresses: use sorted_tokens abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token sequenced_set: add extract_vector method effective_replication_map: clone_endpoints_gently -> clone_data_gently vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints stall_free.hh: add clear_gently for rvalues stall_free.hh: relax Container requirement token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map token_metadata: introduce topology_change_info token_metadata: replace set_topology_transition_state with set_read_new	2023-05-22 21:37:06 +02:00
Tomasz Grabiec	9d4bca26cc	Merge 'raft topology: implement `check_and_repair_cdc_streams` API' from Kamil Braun `check_and_repair_cdc_streams` is an existing API which you can use when the current CDC generation is suboptimal, e.g. after you decommissioned a node the current generation has more stream IDs than you need. In that case you can do `nodetool checkAndRepairCdcStreams` to create a new generation with fewer streams. It also works when you change number of shards on some node. We don't automatically introduce a new generation in that case but you can use `checkAndRepairCdcStreams` to create a new generation with restored shard-colocation. This PR implements the API on top of raft topology, it was originally implemented using gossiper. It uses the `commit_cdc_generation` topology transition state and a new `publish_cdc_generation` state to create new CDC generations in a cluster without any nodes changing their `node_state`s in the process. Closes #13683 * github.com:scylladb/scylladb: docs: update topology-over-raft.md test: topology_experimental_raft: test `check_and_repair_cdc` API raft topology: implement `check_and_repair_cdc_streams` API raft topology: implement global request handling raft topology: introduce `prepare_new_cdc_generation_data` raft_topology: `get_node_to_work_on_opt`: return guard if no node found raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition raft topology: separate `publish_cdc_generation` state raft topology: non-node-specific `exec_global_command` raft topology: introduce `start_operation()` raft topology: non-node-specific `topology_mutation_builder` topology_state_machine: introduce `global_topology_request` topology_state_machine: use `uint16_t` for `enum_class`es raft topology: make `new_cdc_generation_data_uuid` topology-global	2023-05-22 11:33:58 +02:00
Petr Gusev	5976277c2c	token_metadata: drop has_pending_ranges and migration_info Use the new erm::has_pending_ranges function, drop the old implementation from token_metadata.	2023-05-21 13:17:42 +04:00
Petr Gusev	8cb709d3d6	token_metadata: drop update_pending_ranges The function storage_service::update_pending_ranges is turned to update_topology_changes_info. The pending_endpoints and read_endpoints will be computed later, when the erms are rebuilt.	2023-05-21 13:17:42 +04:00
Petr Gusev	10bf8c7901	token_metadata: introduce topology_change_info We plan to move pending_endpoints and read_endpoints, along with their computation logic, from token_metadata to vnode_effective_replication_map. The vnode_effective_replication_map seems more appropriate for them since it contains functionally similar _replication_map and we will be able to reuse pending_endpoints/read_endpoints across keyspaces sharing the same factory_key. At present, pending_endpoints and read_endpoints are updated in the update_pending_ranges function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the migration_info for specific keyspaces. In this commit, we introduce a new topology_change_info structure to hold the first part's data add create an update_topology_change_info function to update it. This structure will later be used in vnode_effective_replication_map to compute pending_endpoints and read_endpoints. This enables the reuse of topology_change_info across all keyspaces, unlike the current update_pending_ranges implementation, which is another benefit of this refactoring. The update_topology_change_info implementation is mostly derived from update_pending_ranges, there are a few differences though: * replacing async and thread with plain co_awaits; * adding a utils::clear_gently call for the previous value to mitigate reactor stalls if target_token_metadata grows large; * substituting immediately invoked lambdas with simple variables and blocks to reduce noise, as lambdas would need to be converted into coroutines. The original update_pending_ranges remains unchanged, and will be removed entirely upon transitioning to the new implementation. Meanwhile, we add an update_topology_change_info call to storage_service::update_pending_ranges so that we can iteratively switch the system to the new implementation.	2023-05-19 19:04:43 +04:00
Petr Gusev	51e80691ef	token_metadata: replace set_topology_transition_state with set_read_new This helps isolate topology::transition_state dependencies, token_metadata doesn't need the entire enum, just this boolean flag.	2023-05-19 19:04:43 +04:00
Kamil Braun	13df85ea11	Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov This implicit link it pretty bad, because feature service is a low-level one which lots of other services depend on. System keyspace is opposite -- a high-level one that needs e.g. query processor and database to operate. This inverse dependency is created by the feature service need to commit enabled features' names into system keyspace on cluster join. And it uses the qctx thing for that in a best-effort manner (not doing anything if it's null). The dependency can be cut. The only place when enabled features are committed is when gossiper enables features on join or by receiving state changes from other nodes. By that time the sharded<system_keyspace> is up and running and can be used. Despite gossiper already has system keyspace dependency, it's better not to overload it with the need to mess with enabling and persisting features. Instead, the feature_enabler instance is equipped with needed dependencies and takes care of it. Eventually the enabler is also moved to feature_service.cc where it naturally belongs. Fixes: #13837 Closes #13172 * github.com:scylladb/scylladb: gossiper: Remove features and sysks from gossiper system_keyspace: De-static save_local_supported_features() system_keyspace: De-static load_\|save_local_enabled_features() system_keyspace: Move enable_features_on_startup to feature_service (cont) system_keyspace: Move enable_features_on_startup to feature_service feature_service: Open-code persist_enabled_feature_info() into enabler gms: Move feature enabler to feature_service.cc gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() gms: Persist features explicitly in features enabler feature_service: Make persist_enabled_feature_info() return a future system_keyspace: De-static load_peer_features() gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features() gossiper: Enable features and register enabler from outside gms: Add feature_service and system_keyspace to feature_enabler	2023-05-18 18:21:06 +02:00
Gleb Natapov	701d6941a5	storage_proxy: raft topology: use gossiper state to populate peers table Some state that is used to fill in 'peeers' table is still propagated over gossiper. When moving a node into the normal state in raft topology code use the data from the gossiper to populate peers table because storage_service::on_change() will not do it in case the node was not in normal state at the time it was called. Fixes: #13911 Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>	2023-05-18 16:00:29 +02:00
Pavel Emelyanov	5216dcb1b3	Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`. This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups. This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR. Refs: #11870 Closes #13869 * github.com:scylladb/scylladb: db/system_keyspace: remove dependency on storage_proxy db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent replica: add query.hh	2023-05-18 10:53:27 +03:00
Avi Kivity	d2d53fc1db	Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map. get_endpoints is used here by gossiper and storage_service for iterations that may preempt instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators. Fixes #13899 Closes #13900 * github.com:scylladb/scylladb: storage_service: do not preempt while traversing endpoint_state_map gossiper: do not preempt while traversing endpoint_state_map	2023-05-16 18:04:35 +03:00
Benny Halevy	1da0b0ff76	storage_service: do not preempt while traversing endpoint_state_map The map iterators might be invalidated while yielding on insert if the map is rehashed. See https://en.cppreference.com/w/cpp/container/unordered_map/insert Refs #13899 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-16 12:24:44 +03:00
Benny Halevy	502b5522ca	storage_service: handle_state_normal: on_internal_error on "owns no tokens" Although this condition should not happen, we suspect that certain timing conditions might lead this state of node in handle_normal_state (possibly when shutdown) has no tokens. Currently we call on_internal_error_noexcept, so if abort_on_internal_error is false, we will just print an error and continue on with handle_state_normal. Change that to `on_internal_error` so to throw an exception in production in this unexpected state. Refs #13801 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-15 12:49:17 +03:00
Botond Dénes	157fdb2f6d	db/system_keyspace: remove dependency on storage_proxy The methods that take storage_proxy as argument can now accept a replica::database instead. So update their signatures and update all callers. With that, system_keyspace.* no longer depends on storage_proxy directly.	2023-05-12 07:27:55 -04:00
Petr Gusev	08529a1c6c	storage_proxy, storage_service: use new read endpoints We use set_topology_transition_state to set read_new state in storage_service::topology_state_load based on _topology_state_machine._topology.tstate. This triggers update_pending_ranges to compute and store new ranges for read requests. We use this information in storage_proxy::get_endpoints_for_reading when we need to decide which nodes to use for reading.	2023-05-09 18:42:03 +04:00
Kamil Braun	372a06f735	raft topology: implement `check_and_repair_cdc_streams` API The original API is gossiper-based. Since we're moving CDC generations handling to Raft-based topology, we need to implement this API as well. For now the API creates a new generation unconditionally, in a follow-up I'll introduce a check to skip the creation if the current generation is optimal.	2023-05-08 16:49:01 +02:00
Kamil Braun	a09ed01ffa	raft topology: implement global request handling The only possible request for now is creating a new CDC generation.	2023-05-08 16:49:01 +02:00
Kamil Braun	2bd333dd84	raft topology: introduce `prepare_new_cdc_generation_data` Refactor the code, taking a bulk of the CDC-specific code used when there's a bootstrap request to a separate function. We'll use it elsewhere as well.	2023-05-08 16:48:59 +02:00
Kamil Braun	2863ef3df4	raft_topology: `get_node_to_work_on_opt`: return guard if no node found We'll need the guard back.	2023-05-08 16:48:29 +02:00
Kamil Braun	afcf17f168	raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition We don't need it for anything in this state, and this change allows us to commit CDC generations without transitioning nodes.	2023-05-08 16:47:37 +02:00
Kamil Braun	1b21a3c5ae	raft topology: separate `publish_cdc_generation` state Previously the generation committed in `commit_cdc_generation` state would be published by the coordinator in `write_both_read_old` state. This logic assumed that we only create new CDC generations during node bootstrap. We'll allow committing new generations without bootstrap (without any node transitions in fact), so we need this separate state. After publishing the generation, we check whether there is a transitioning node; if so, we'll enter `write_both_read_old` as next state, otherwise we'll make the topology non-transitioning.	2023-05-08 16:47:24 +02:00
Kamil Braun	6d5b8c1b7c	raft topology: non-node-specific `exec_global_command` This function broadcasts a command to cluster members. It takes a `node_to_work_on`. We'll need a version which works in situations where there is no 'node to work on'.	2023-05-08 16:47:13 +02:00
Kamil Braun	8b5237a058	raft topology: introduce `start_operation()` This calls `raft_group0_client::start_operation` and checks if the term is different from the term that the coordinator was initially created with; if so, we must no longer continue coordinating the topology. There was one direct call to `raft_group0_client::start_operation` without a term check, replace it with the introduced function.	2023-05-08 16:47:13 +02:00
Kamil Braun	90770f712c	raft topology: non-node-specific `topology_mutation_builder` The existing `topology_mutation_builder` took a `raft::server_id` in its constructor and immediately created a clustering row in the `system.topology` mutation that it was building for the given node. This does not allow building mutations which only affect the static columns. Split the class into two: - `topology_mutation_builder` doesn't take `raft::server_id` in its constructor and contains only the methods that are used to set static columns. It also has a `with_node` method taking a `raft::server_id` which returns a `topology_node_mutation_builder&`. - `topology_node_mutation_builder` creates the clustering row and allows seting its columns. We'll use `topology_mutation_builder` when we only want to transition the cluster-global topology state, without affecting any specific nodes' states.	2023-05-08 16:47:11 +02:00
Kamil Braun	93dcdcd4eb	raft topology: make `new_cdc_generation_data_uuid` topology-global - make it a static column in `system.topology` - move it from node-specific `ring_slice` to cluster-global `topology` We will use it in scenarios where no node is transitioning. Also make it `std::optional` in topology for consistency with other fields (previously, the 'no value' state for this field was represented using default-constructed `utils::UUID`).	2023-05-08 16:46:14 +02:00
Kamil Braun	aba31ad06c	storage_service: use `seastar::format` instead of `fmt::format` For some reason Scylla crashes on `aarch64` in release mode when calling `fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this line: ``` group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id())); ``` I found this in our configure.py: ``` def get_clang_inline_threshold(): if args.clang_inline_threshold != -1: return args.clang_inline_threshold elif platform.machine() == 'aarch64': # we see miscompiles with 1200 and above with format("{}", uuid) # also coroutine miscompiles with 600 return 300 else: return 2500 ``` but reducing it to `0` didn't help. I managed to get the following backtrace (with inline threshold 0): ``` void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751 (inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728 (inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255 void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const, wchar_t const) at ??:? void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:? fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:? std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206 (inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572 ``` Maybe it's a bug in `fmt` library? In any case replacing the call with `::format` (i.e. `seastar::format` from seastar/core/print.hh) helps. Do it for the entire file for consistency (and avoiding this bug). Also, for the future, replace `format` calls with `::format` - now it's the same thing, but the latter won't clash with `std::format` once we switch to libstdc++13. Fixes #13707 Closes #13711	2023-05-05 19:23:22 +02:00
Tomasz Grabiec	e385ce8a2b	Merge "fix stack use after free during shutdown" from Gleb storage_service uses raft_group0 but the during shutdown the later is destroyed before the former is stopped. This series move raft_group0 destruction to be after storage_service is stopped already. For the move to work some existing dependencies of raft_group0 are dropped since they do not really needed during the object creation. Fixes #13522	2023-05-04 15:14:18 +02:00
Gleb Natapov	e9fb885e82	service/raft: raft_group0: drop dependency on cdc::generation_service raft_group0 does not really depends on cdc::generation_service, it needs it only transiently, so pass it to appropriate methods of raft_group0 instead of during its creation.	2023-05-04 13:03:07 +03:00
Tomasz Grabiec	aba5667760	Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun We change the meaning and name of `replication_state`: previously it was meant to describe the "state of tokens" of a specific node; now it describes the topology as a whole - the current step in the 'topology saga'. It was moved from `ring_slice` into `topology`, renamed into `transition_state`, and the topology coordinator code was modified to switch on it first instead of node state - because there may be no single transitioning node, but the topology itself may be transitioning. This PR was extracted from #13683, it contains only the part which refactors the infrastructure to prepare for non-node specific topology transitions. Closes #13690 * github.com:scylladb/scylladb: raft topology: rename `update_replica_state` -> `update_topology_state` raft topology: remove `transition_state::normal` raft topology: switch on `transition_state` first raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` raft topology: parse replaced node in `exec_global_command` raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on` storage_service: extract raft topology coordinator fiber to separate class raft topology: rename `replication_state` to `transition_state` raft topology: make `replication_state` a topology-global state	2023-04-30 10:55:24 +02:00
Asias He	a8040306bb	storage_service: Fix removing replace node as pending Consider - n1, n2, n3 - n3 is down - n4 replaces n3 with the same ip address 127.0.0.3 - Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2 ``` auto host_id = _gossiper.get_host_id(endpoint); auto existing = tmptr->get_endpoint_for_host_id(host_id); ``` host_id = new host id existing = empty As a result, del_replacing_endpoint() will not be called. This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when replacing is done. This is wrong. This is a regression since commit `9942c60d93` (storage_service: do not inherit the host_id of a replaced a node), where replacing node uses a new host id than the node to be replaced. To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing is empty. Before: n1: storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3 After: n1: storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3 Tests: https://github.com/scylladb/scylla-dtest/pull/3126 Closes #13677	2023-04-27 21:03:01 +03:00
Kamil Braun	0bee872fb1	raft topology: rename `update_replica_state` -> `update_topology_state` The new name is more generic and appropriate for topology transitions which don't affect any specific replica but the entire cluster as a whole (which we'll introduce later). Also take `guard` directly instead of `node_to_work_on` in this more generic function. Since we want `node_to_work_on` to die when we steal its guard, introduce `take_guard` which takes ownership of the object and returns the guard.	2023-04-27 15:22:19 +02:00
Kamil Braun	22ab5982e7	raft topology: remove `transition_state::normal` What this state really represented is that there is currently no transition. So remove it and make `transition_state` optional instead.	2023-04-27 15:18:32 +02:00
Kamil Braun	61c4e0ae20	raft topology: switch on `transition_state` first Previously the code assumed that there was always a 'node to work on' (a node which wants to change its state) or there was no work to do at all. It would find such a node, switch on its state (e.g. check if it's bootstrapping), and in some states switch on the topology `transition_state` (e.g. check if it's `write_both_read_old`). We want to introduce transitions that are not node-specific and can work even when all nodes are 'normal' (so there's no 'node to work on'). As a first step, we refactor the code so it switches on `transition_state` first. In some of these states, like `write_both_read_old`, there must be a 'node to work on' for the state to make sense; but later in some states it will be optional (such as `commit_cdc_generation`).	2023-04-27 15:14:59 +02:00
Kamil Braun	a023ca2cf1	raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` A more descriptive name.	2023-04-27 15:12:12 +02:00
Kamil Braun	4ddfce8213	raft topology: parse replaced node in `exec_global_command` Will make following commits easier.	2023-04-27 15:10:49 +02:00
Kamil Braun	bafce8fd28	raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`	2023-04-27 15:04:36 +02:00
Kamil Braun	98f69f52aa	storage_service: extract raft topology coordinator fiber to separate class The lambdas defined inside the fiber are now methods of this class. Currently `handle_node_transition` is calling `handle_ring_transition`, in a later commit we will reverse this: `handle_ring_transition` will call `handle_node_transition`. We won't have to shuffle the functions around because they are members of the same class, making the change easier to review. In general, the code will be easier to maintain in this new form (no need to deal with so many lambda captures etc.) Also break up some lines which exceeded the 120 character limit (as per Seastar coding guidelines).	2023-04-27 15:04:35 +02:00
Kamil Braun	defa63dc20	raft topology: rename `replication_state` to `transition_state` The new name is more generic - it describes the current step of a 'topology saga` (a sequence of steps used to implement a larger topology operation such as bootstrap).	2023-04-27 11:39:38 +02:00
Kamil Braun	af1ea2bb16	raft topology: make `replication_state` a topology-global state Previously it was part of `ring_slice`, belonging to a specific node. This commit moves it into `topology`, making it a cluster-global property. The `replication_state` column in `system.topology` is now `static`. This will allow us to easily introduce topology transition states that do not refer to any specific node. `commit_cdc_generation` will be such a state, allowing us to commit a new CDC generation even though all nodes are normal (none are transitioning). One could argue that the other states are conceptually already cluster-global: for example, `write_both_read_new` doesn't affect only the tokens of a bootstrapping (or decommissioning etc.) node; it affects replica sets of other tokens as well (with RFs greater than 1).	2023-04-27 11:39:38 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Tomasz Grabiec	ce94a2a5b0	Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun Fix two issues with the replace operation introduced by recent PRs. Add a test which performs a sequence of basic topology operations (bootstrap, decommission, removenode, replace) in a new suite that enables the `raft` experimental feature (so that the new topology change coordinator code is used). Fixes: #13651 Closes #13655 * github.com:scylladb/scylladb: test: new suite for testing raft-based topology test: remove topology_custom/test_custom.py raft topology: don't require new CDC generation UUID to always be present raft topology: include shard_count/ignore_msb during replace	2023-04-26 11:38:07 +02:00
Pavel Emelyanov	5cbc8fe2f9	system_keyspace: De-static save_local_supported_features() That's, in fact, an independent change, because feature enabler doesn't need this method. So this patch is like "while at it" thing, but on the other hand it ditches one more qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:04:54 +03:00
Pavel Emelyanov	dcf88b07a4	gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() This will make it possible to move the enabler to feature_service.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:56:07 +03:00
Pavel Emelyanov	1ee04e4934	system_keyspace: De-static load_peer_features() This makes use of feature_enabler::_sys_ks dependency and gets rid of one more global qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:50:00 +03:00
Pavel Emelyanov	ac60d8afca	gossiper: Enable features and register enabler from outside It's a bit hairy. The maybe_enable_features() is called from two places -- the feature_enabler upon notifications from gossiper and directory by gossiper from wait_for_gossip_to_settle(). The _latter_ is called only when the wait_for_gossip_to_settle() is called for the first time because of the _gossip_settled checks in it. For the first time this method is called by storage_service when it tries to join the ring (next it's called from main, but that's not of interest here). Next, despite feature_enabler is registered early -- when gossiper instance is constructed by sharded<gossiper>::start() -- it checks for the _gossip_settled to be true to take any actions. Considering both -- calling maybe_enable_features() _and_ registering enabler after storage_service's call to wait_for_gossip_to_settle() doesn't break the code logic, but make further patching possible. In particular, the feature_enabler will move to feature_service not to pollute gossiper code with anything that's not gossiping. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:42:17 +03:00
Kefu Chai	5804eb6d81	storage_service: specialize fmt::formatter<storage_service::mode> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `storage_service::mode` without the help of `operator<<`. the corresponding `operator<<()` for `storage_service::mode` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13640	2023-04-25 14:20:57 +02:00

1 2 3 4 5 ...

1387 Commits