scylla

Author	SHA1	Message	Date
Botond Dénes	a48881801a	replica/tablets: drop keyspace_name from system.tablets partition-key The name of the keyspace being part of the partition key is not useful, the table_id already uniquely identifies the table. The keyspace name being part of the key, means that code wanting to interact with this table, often has to resolve the table id, just to be able to provide the keyspace name. This is counter productive, so make the keyspace_name just a static column instead, just like table_name already is. Fixes: #16377 Closes scylladb/scylladb#16881	2024-01-22 13:12:02 +01:00
Kamil Braun	1007ac4956	Merge 'sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes' from Petr Gusev Before the patch we called `gossiper.remove_endpoint` for IP-s of the left nodes. The problem is that in replace-with-same-ip scenario we called `gossiper.remove_endpoint` for IP which is used by the new, replacing node. The `gossiper.remove_endpoint` method puts the IP into quarantine, which means gossiper will ignore all events about this IP for `quarantine_delay` (one minute by default). If we immediately replace just replaced node with the same IP again, the bootstrap will fail since the gossiper events are blocked for this IP, and we won't be able to resolve an IP for the new host_id. Another problem was that we called gossiper.remove_endpoint method, which doesn't remove an endpoint from `_endpoint_state_map`, only from live and unreachable lists. This means the IP will keep circulating in the gossiper message exchange between cluster nodes until full cluster restart. This patch fixes both of these problems. First, we rely on the fact that when topology coordinator moves the `being_replaced` node to the left state, the IP of the `replacing` node is known to all nodes. This means before removing an IP from the gossiper we can check if this IP is currently used by another node in the current raft topology. This is done by constructing the `used_ips` map based on normal and transition nodes. This map is cached to avoid quadratic behaviour. Second, we call `gossiper.force_remove_endpoint`, not `gossiper.remove_endpoint`. This function removes and IP from `_endpoint_state_map`, as well as from live and unreachable lists. Closes scylladb/scylladb#16820 * github.com:scylladb/scylladb: get_peer_info_for_update: update only required fields in raft topology mode get_peer_info_for_update: introduce set_field lambda storage_service::on_change: fix indent storage_service::on_change: skip handle_state functions in raft topology mode test_replace_different_ip: check old IP is removed from gossiper test_replace: check two replace with same IP one after another storage_service: sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes	2024-01-22 11:25:55 +01:00
Petr Gusev	5de970e430	get_peer_info_for_update: update only required fields in raft topology mode Some fields of system.peers table are updated through raft, we don't need to peek them from gossiper. The goal of the patch is to declare explicitly which code is responsible for which fields. In particular, in raft topology mode we don't need to update raft-managed fields since it's done in topology_state_load and raft_ip_address_updater.	2024-01-19 20:37:12 +04:00
Petr Gusev	f51f843b67	get_peer_info_for_update: introduce set_field lambda This is a refactoring commit. In the next commit we'll add a parameter to this unified lambda and this is easy to do if we have only one lambda and not three.	2024-01-19 20:37:12 +04:00
Petr Gusev	37063e2432	storage_service::on_change: fix indent	2024-01-19 20:37:12 +04:00
Petr Gusev	8e6b569de5	storage_service::on_change: skip handle_state functions in raft topology mode We don't need them in raft topology mode since the token_metadata update happens in topology_state_load function. We lift the _raft_topology_change_enabled checks from those functions to on_change.	2024-01-19 20:37:12 +04:00
Pavel Emelyanov	e62114214f	Merge 'More logging for Raft-based topology' from Kamil Braun Currently if topology coordinator gets stuck in a CI test run it's hard to debug this (e.g. scylladb/scylladb#16708). We can add a lot of logging inside topology coordinator code to aid debugging, without spamming the logs -- these are relatively rare control plane events. Closes scylladb/scylladb#16749 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: enable raft_topology=debug level by default raft topology: increase level of some TRACE messages raft topology: log when entering transition states raft topology: don't include null ID in exclude_nodes raft topology: INFO log when executing global commands and updating topology state storage_service: separate logger for raft topology	2024-01-19 16:19:44 +03:00
Petr Gusev	30b2e5838c	storage_service: sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes Before the patch we called gossiper.remove_endpoint for IP-s of the left nodes. The problem is that in replace-with-same-ip scenario we called gossiper.remove_endpoint for IP which is used by the new, replacing node. The gossiper.remove_endpoint method puts the IP into quarantine, which means gossiper will ignore all events about this IP for quarantine_delay (one minute by default). If we immediately replace just replaced node with the same IP again, the bootstrap will fail since the gossiper events are blocked for this IP, and we won't be able to resolve an IP for the new host_id. Another problem was that we called gossiper.remove_endpoint method, which doesn't remove an endpoint from _endpoint_state_map, only from live and unreachable lists. This means the IP will keep circulating in the gossiper message exchange between cluster nodes until full cluster restart. This patch fixes both of these problems. First, we rely on the fact that when topology coordinator moves the being_replaced node to the left state, the IP of the replacing node is known to all nodes. This means before removing an IP from the gossiper we can check if this IP is currently used by another node in the current raft topology. This is done by constructing the used_ips map based on normal and transition nodes. This map is cached to avoid quadratic behaviour. Second, we call gossiper.force_remove_endpoint, not gossiper.remove_endpoint. This function removes and IP from _endpoint_state_map, as well as from live and unreachable lists. The tests for both of these improvements will be added in subsequent commits.	2024-01-19 12:24:04 +04:00
Asias He	d3efb3ab6f	storage_service: Set session id for raft_rebuild Raft rebuild is broken because the session id is not set. The following was seen when run rebuild stream_session - [Stream #8cfca940-afc9-11ee-b6f1-30b8f78c1451] stream_transfer_task: Fail to send to 127.0.70.1:0: seastar::rpc::remote_verb_error (Session not found: 00000000-0000-0000-0000-000000000000) with raft topology, e.g., scylla --enable-repair-based-node-ops 0 --consistent-cluster-management true --experimental-features consistent-topology-changes Fix by setting the session id. Fixes #16741 Closes scylladb/scylladb#16814	2024-01-18 12:47:20 +02:00
Kamil Braun	52e67ca121	raft topology: increase level of some TRACE messages Increased them to DEBUG level, and in one case to WARN (inside an exception handler). The selected messages are still relatively rare (per-node per-transition control plane events, plus events such as fibers sleeping and waking up) although more low level. They are also small messages. Messages that are large such as those which print all tokens of nodes or large mutations are left on TRACE level. The plan is to enable DEBUG level logging in test.py tests for raft_topology, while not spamming the logs completely such as by printing large mutations.	2024-01-18 11:24:16 +01:00
Kamil Braun	92e6604127	raft topology: log when entering transition states Those are rare control plane events, but might be useful when debugging problems with topology coordinator (e.g. where it got stuck).	2024-01-18 11:24:15 +01:00
Kamil Braun	aeb53ea31d	raft topology: don't include null ID in exclude_nodes Observed with newly added logs: ``` raft topology - executing global topology command barrier_and_drain, excluded nodes: {00000000-0000-0000-0000-000000000000} ```	2024-01-18 11:24:15 +01:00
Kamil Braun	ae25f703c4	raft topology: INFO log when executing global commands and updating topology state Those are rare control plane events, but useful for debugging e.g. if topology coordinator gets stuck at some point.	2024-01-18 11:24:15 +01:00
Kamil Braun	71957b4320	storage_service: separate logger for raft topology Allows selectively enabling higher logging levels for just raft-topology related things, without doing it for the entire storage_service (which includes things like gossiper callbacks). Also gets rid of the redundant "raft topology:" prefix which was also not included everywhere.	2024-01-18 11:24:14 +01:00
Kefu Chai	0ae81446ef	./: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16766	2024-01-17 16:30:14 +02:00
Kamil Braun	787b24cd24	Merge 'raft topology: join: shut down a node on error in response handler' from Patryk Jędrzejczak If the joining node fails while handling the response from the topology coordinator, it hangs even though it knows the join operation has failed. Therefore, we ensure it shuts down in this patch. Additionally, we ensure that if the first join request response was a rejection or the node failed while handling it, the following acceptances by the (possibly different) coordinator don't succeed. The node considers the join operation as failed. We shouldn't add it to the cluster. Fixes scylladb/scylladb#16333 Closes scylladb/scylladb#16650 * github.com:scylladb/scylladb: topology_coordinator: clarify warnings raft topology: join: allow only the first response to be a succesful acceptance storage_service: join_node_response_handler: fix indentation raft topology: join: shut down a node on error in response handler	2024-01-17 14:55:26 +01:00
Botond Dénes	f22fc88a64	Merge 'Configure service levels interval' from Michał Jadwiszczak Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests. This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest. Closes scylladb/scylladb#16394 * github.com:scylladb/scylladb: test:cql-pytest: change service levels intervals in tests configure service levels interval	2024-01-17 12:24:49 +02:00
Tomasz Grabiec	3d76aefb98	Merge "Enhance topology request status tracking" from Gleb Currently to figure out if a topology request is complete a submitter checks the topology state and tries to figure out from that the status of the request. This is not exact. Lets look at rebuild handling for instance. To figure out if request is completed the code waits for request object to disappear from the topology, but if another rebuild starts between the end of the previous one and the code noticing that it completed the code will continue waiting for the next rebuild. Another problem is that in case of operation failure there is no way to pass an error back to the initiator. This series solves those problems by assigning an id for each request and tracking the status of each request in a separate table. The initiator can query the request status from the table and see if the request was completed successfully or if it failed with an error, which is also evadable from the table. The schema for the table is: CREATE TABLE system.topology_requests ( id timeuuid PRIMARY KEY, initiating_host uuid, start_time timestamp, done boolean, error text, end_time timestamp, ); and all entries have TTL of one month.	2024-01-17 00:37:19 +01:00
Gleb Natapov	9a7243d71a	storage_service: topology coordinator: Consolidate some mutation builder code	2024-01-16 17:02:54 +02:00
Gleb Natapov	a145a73136	storage_service: topology coordinator: make topology operation rollback error more informative Include an error which caused the rollback.	2024-01-16 17:02:54 +02:00
Gleb Natapov	bf91eb37f2	storage_service: topology coordinator: make topology operation cancellation error more informative Include the list of nodes that were down when cancellation happened.	2024-01-16 17:02:54 +02:00
Gleb Natapov	8beb399b72	storage_service: topology coordinator: consolidate some code in cancel_all_requests There is a code duplication that can be avoided.	2024-01-16 17:02:54 +02:00
Gleb Natapov	fba6877b3e	storage_service: topology coordinator: TTL topology request table To prevent topology_request table growth TTL all writes to expire after a month.	2024-01-16 17:02:54 +02:00
Gleb Natapov	d576ed31dc	storage_service: topology request: drop explicit shutdown rpc Now that we have explicit status for each request we may use it to replace shutdown notification rpc. During a decommission, in left_token_ring state, we set done to true after metadata barrier that waits for all request to the decommissioning node to complete and notify the decommissioning node with a regular barrier. At this point the node will see that the request is complete and exit.	2024-01-16 17:02:54 +02:00
Gleb Natapov	84197ff735	storage_service: topology coordinator: check topology operation completion using status in topology_requests table Instead of trying to guess if a request completed by looking into the topology state (which is sometimes can be error prone) look at the request status in the new topology_requests. If request failed report a reason for the failure from the table.	2024-01-16 17:02:54 +02:00
Kefu Chai	2dbf044b91	cql3: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16791	2024-01-16 16:43:17 +02:00
Gleb Natapov	1c18476385	storage_service: topology coordinator: update topology_requests table with requests progress Make topology coordinator update request's status in topology_requests table as it changes.	2024-01-16 15:35:18 +02:00
Gleb Natapov	1ce1c5001d	topology coordinator: add topology_requests table to group0 snapshot Since the table is updated through raft's group0 state machine its content needs to be part of the snapshot.	2024-01-16 13:57:27 +02:00
Gleb Natapov	584551f849	topology coordinator: add request_id to the topology state machine Provide a unique ID for each topology request and store it the topology state machine. It will be used to index new topology requests table in order to retrieve request status.	2024-01-16 13:57:27 +02:00
Avi Kivity	5e70dd1dbe	database: don't allow keyspace objects to be copied keyspace objects are heavyweight and copies are immediately our-of-date, so copying them is bad. Fix by deleting the copy constructor and copy assignment operator. One call site is fixed. This call site is safe since the it's only used for accessing a few attributes (introduced in `f70c4127c6`). Closes scylladb/scylladb#16782	2024-01-15 21:48:32 +01:00
Kefu Chai	e5300f3e21	topology_state_machine: add formatter for service::cleanup_status before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for service::cleanup_status, and remove its operator<<(). Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16778	2024-01-15 21:31:42 +02:00
Kefu Chai	ece2bd2f6e	service: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16764	2024-01-15 13:29:33 +02:00
Gleb Natapov	ba7aa0d582	storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator	2024-01-14 15:45:53 +02:00
Gleb Natapov	1afc891bd5	storage_service: topology coordinator: add logging to removenode and decommission Add some useful logging to removenode and decommission to be used by tests later.	2024-01-14 15:45:53 +02:00
Gleb Natapov	97ab3f6622	storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator Introduce new REST API "/storage_service/cleanup_all" that, when triggered, instructs the topology coordinator to initiate cluster wide cleanup on all dirty nodes. It is done by introducing new global command "global_topology_request::cleanup".	2024-01-14 15:45:53 +02:00
Gleb Natapov	0adb3904d8	storage_service: topology coordinator: manage cluster cleanup as part of the topology management Sometimes it is unsafe to start a new topology operation before cleanup runs on dirty nodes. This patch detects the situation when the topology operation to be executed cannot be run safely until all dirty nodes do cleanup and initiates the cleanup automatically. It also waits for cleanup to complete before proceeding with the topology operation. There can be a situation that nodes that needs cleanup dies and will never clear the flag. In this case if a topology operation that wants to run next does not have this node in its ignore node list it may stuck forever. To fix this the patch also introduces the "liveness aware" request queue management: we do not simple choose _a_ request to run next, but go over the queue and find requests that can proceed considering the nodes liveness situation. If there are multiple requests eligible to run the patch introduces the order based on the operation type: replace, join, remove, leave, rebuild. The order is such so to not trigger cleanup needlessly.	2024-01-14 15:45:50 +02:00
Gleb Natapov	c9b7bd5a33	storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter Needed by the next patch.	2024-01-14 14:44:07 +02:00
Gleb Natapov	067267ff76	storage_service: topology coordinator: make topology coordinator lifecycle subscriber We want to change the coordinator to consider nodes liveness when processing the topology operation queue. If there is no enough live nodes to process any of the ops we want to cancel them. For that to work we need to be able to kick the coordinator if liveness situation changes.	2024-01-14 14:44:07 +02:00
Gleb Natapov	f70c4127c6	storage_service: topology coordinator: introduce sstable cleanup fiber Introduce a fiber that waits on a topology event and when it sees that the node it runs on needs to perform sstable cleanup it initiates one for each non tablet, non local table and resets "cleanup" flag back to "clean" in the topology.	2024-01-14 14:44:07 +02:00
Gleb Natapov	5b246920ae	storage_proxy: allow to wait for all ongoing writes We want to be able to wait for all writes started through the storage proxy before a fence is advanced. Add phased_barrier that is entered on each local write operation before checking the fence to do so. A write will be either tracked by the phased_barrier or fenced. This will be needed to wait for all non fenced local writes to complete before starting a cleanup.	2024-01-14 14:44:07 +02:00
Gleb Natapov	b2ba77978c	storage_service: topology coordinator: mark nodes as needing cleanup when required A cleanup needs to run when a node loses an ownership of a range (during bootstrap) or if a range movement to an normal node failed (removenode, decommission failure). Mark all dirty node as "cleanup needed" in those cases.	2024-01-14 14:43:59 +02:00
Gleb Natapov	dbededb1a6	storage_service: add mark_nodes_as_cleanup_needed function The function creates a mutation that sets cleanup to "needed" for each normal node that, according to the erm, has data it does not own after successful or unsuccessful topology operation.	2024-01-14 14:43:33 +02:00
Gleb Natapov	cc54796e23	raft topology: add cleanup state to the topology state machine The patch adds cleanup state to the persistent and in memory state and handles the loading. The state can be "clean" which means no cleanup needed, "needed" which means the node is dirty and needs to run cleanup at some point, "running" which means that cleanup is running by the node right now and when it will be completed the state will be reset to "clean".	2024-01-14 13:30:54 +02:00
Kamil Braun	4e18f8b453	Merge 'topology_state_load: stop waiting for IP-s' from Petr Gusev The loop in `id2ip` lambda makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the `IP` for a node and stuck in the loop forever. In this series we replace the loop with an if - we just don't update the `host_id <-> ip` mapping in the `token_metadata.topology` if we don't have an `IP` yet. The PR moves `host_id -> IP` resolution to the data plane, now it happens each time the IP-based methods of `erm` are called. We need this because IPs may not be known at the time the erm is built. The overhead of `raft_address_map` lookup is added to each data plane request, but it should be negligible. In this PR `erm/resolve_endpoints` continues to treat missing IP for `host_id` as `internal_error`, but we plan to relax this in the follow-up (see this PR first comment). Closes scylladb/scylladb#16639 * github.com:scylladb/scylladb: raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes storage_service: topology_state_load: remove IP waiting loop storage_service: sync_raft_topology_nodes: add target_node parameter storage_service: sync_raft_topology_nodes: move loops to the end storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node storage_service: sync_raft_topology_nodes: move update_topology up storage_service: topology_state_load: remove clone_async/clear_gently overhead storage_service: fix indentation storage_service: extract sync_raft_topology_nodes storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata address_map: move gossiper subscription logic into storage_service topology_coordinator: exec_global_command: small refactor, use contains + reformat storage_service: wait_for_ip for new nodes storage_service.idl.hh: fix raft_topology_cmd.command declaration erm: for_each_natural_endpoint_until: use is_vnode == true erm: switch the internal data structures to host_id-s erm: has_pending_ranges: switch to host_id	2024-01-12 18:46:51 +01:00
Petr Gusev	e24bee545b	raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater	2024-01-12 18:29:22 +04:00
Petr Gusev	6e7bbc94f4	gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes When a node changes its IP we need to store the mapping in system.peers and update token_metadata.topology and erm in-memory data structures. The test_change_ip was improved to verify this new behaviour. Before this patch the test didn't check that IPs used for data requests are updated on IP change. In this commit we add the read/write check. It fails on insert with 'node unavailable' error without the fix.	2024-01-12 18:28:57 +04:00
Petr Gusev	6d6e1ba8fb	storage_service: topology_state_load: remove IP waiting loop The loop makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the IP for a node and stuck in the loop forever. The idea of the patch is to replace the loop with an if - we just don't update the host_id <-> ip mapping in the token_metadata.topology if we don't have an IP yet. When we get the mapping later, we'll call sync_raft_topology_nodes again from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	260874c860	storage_service: sync_raft_topology_nodes: add target_node parameter If it's set, instead of going over all the nodes in raft topology, the function will update only the specified node. This parameter will be used in the next commit, in the call to sync_raft_topology_nodes from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	a9d58c3db5	storage_service: sync_raft_topology_nodes: move loops to the end	2024-01-12 15:37:50 +04:00
Petr Gusev	d1bce3651b	storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node	2024-01-12 15:37:50 +04:00

1 2 3 4 5 ...

4116 Commits