Commit Graph

4116 Commits

Author SHA1 Message Date
Botond Dénes
a48881801a replica/tablets: drop keyspace_name from system.tablets partition-key
The name of the keyspace being part of the partition key is not useful,
the table_id already uniquely identifies the table. The keyspace name
being part of the key, means that code wanting to interact with this
table, often has to resolve the table id, just to be able to provide the
keyspace name. This is counter productive, so make the keyspace_name
just a static column instead, just like table_name already is.

Fixes: #16377

Closes scylladb/scylladb#16881
2024-01-22 13:12:02 +01:00
Kamil Braun
1007ac4956 Merge 'sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes' from Petr Gusev
Before the patch we called `gossiper.remove_endpoint` for IP-s of the
left nodes. The problem is that in replace-with-same-ip scenario we
called `gossiper.remove_endpoint` for IP which is used by the new,
replacing node. The `gossiper.remove_endpoint` method puts the IP into
quarantine, which means gossiper will ignore all events about this IP
for `quarantine_delay` (one minute by default). If we immediately
replace just replaced node with the same IP again, the bootstrap will
fail since the gossiper events are blocked for this IP, and we won't be
able to resolve an IP for the new host_id.

Another problem was that we called gossiper.remove_endpoint method,
which doesn't remove an endpoint from `_endpoint_state_map`, only from
live and unreachable lists. This means the IP will keep circulating in
the gossiper message exchange between cluster nodes until full cluster
restart.

This patch fixes both of these problems. First, we rely on the fact that
when topology coordinator moves the `being_replaced` node to the left
state, the IP of the `replacing` node is known to all nodes. This means
before removing an IP from the gossiper we can check if this IP is
currently used by another node in the current raft topology. This is
done by constructing the `used_ips` map based on normal and transition
nodes. This map is cached to avoid quadratic behaviour.

Second, we call `gossiper.force_remove_endpoint`, not
`gossiper.remove_endpoint`. This function removes and IP from
`_endpoint_state_map`, as well as from live and unreachable lists.

Closes scylladb/scylladb#16820

* github.com:scylladb/scylladb:
  get_peer_info_for_update: update only required fields in raft topology mode
  get_peer_info_for_update: introduce set_field lambda
  storage_service::on_change: fix indent
  storage_service::on_change: skip handle_state functions in raft topology mode
  test_replace_different_ip: check old IP is removed from gossiper
  test_replace: check two replace with same IP one after another
  storage_service: sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes
2024-01-22 11:25:55 +01:00
Petr Gusev
5de970e430 get_peer_info_for_update: update only required fields in raft topology mode
Some fields of system.peers table are updated
through raft, we don't need to peek them from gossiper.

The goal of the patch is to declare explicitly
which code is responsible for which fields.
In particular, in raft topology mode we don't
need to update raft-managed fields since
it's done in topology_state_load and
raft_ip_address_updater.
2024-01-19 20:37:12 +04:00
Petr Gusev
f51f843b67 get_peer_info_for_update: introduce set_field lambda
This is a refactoring commit. In the next commit
we'll add a parameter to this unified lambda and
this is easy to do if we have only one lambda and
not three.
2024-01-19 20:37:12 +04:00
Petr Gusev
37063e2432 storage_service::on_change: fix indent 2024-01-19 20:37:12 +04:00
Petr Gusev
8e6b569de5 storage_service::on_change: skip handle_state functions in raft topology mode
We don't need them in raft topology mode since the token_metadata
update happens in topology_state_load function. We lift the
_raft_topology_change_enabled checks from those functions to on_change.
2024-01-19 20:37:12 +04:00
Pavel Emelyanov
e62114214f Merge 'More logging for Raft-based topology' from Kamil Braun
Currently if topology coordinator gets stuck in a CI test run it's hard to debug this (e.g. scylladb/scylladb#16708). We can add a lot of logging inside topology coordinator code to aid debugging, without spamming the logs -- these are relatively rare control plane events.

Closes scylladb/scylladb#16749

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: enable raft_topology=debug level by default
  raft topology: increase level of some TRACE messages
  raft topology: log when entering transition states
  raft topology: don't include null ID in exclude_nodes
  raft topology: INFO log when executing global commands and updating topology state
  storage_service: separate logger for raft topology
2024-01-19 16:19:44 +03:00
Petr Gusev
30b2e5838c storage_service: sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes
Before the patch we called gossiper.remove_endpoint for IP-s
of the left nodes. The problem is that in replace-with-same-ip
scenario we called gossiper.remove_endpoint for IP which is
used by the new, replacing node. The gossiper.remove_endpoint
method puts the IP into quarantine, which means gossiper will
ignore all events about this IP for quarantine_delay (one minute by
default). If we immediately replace just replaced node with
the same IP again, the bootstrap will fail since the gossiper
events are blocked for this IP, and we won't be able to
resolve an IP for the new host_id.

Another problem was that we called gossiper.remove_endpoint
method, which doesn't remove an endpoint from _endpoint_state_map,
only from live and unreachable lists. This means the IP
will keep circulating in the gossiper message exchange between cluster
nodes until full cluster restart.

This patch fixes both of these problems. First, we rely on
the fact that when topology coordinator moves the being_replaced
node to the left state, the IP of the replacing node is known to all nodes.
This means before removing an IP from the gossiper we can check if
this IP is currently used by another node in the current raft topology.
This is done by constructing the used_ips map based on normal and
transition nodes. This map is cached to avoid quadratic behaviour.

Second, we call gossiper.force_remove_endpoint, not
gossiper.remove_endpoint. This function removes and IP from
_endpoint_state_map, as well as from live and unreachable lists.

The tests for both of these improvements will be added in subsequent
commits.
2024-01-19 12:24:04 +04:00
Asias He
d3efb3ab6f storage_service: Set session id for raft_rebuild
Raft rebuild is broken because the session id is not set.

The following was seen when run rebuild

stream_session - [Stream #8cfca940-afc9-11ee-b6f1-30b8f78c1451]
stream_transfer_task: Fail to send to 127.0.70.1:0:
seastar::rpc::remote_verb_error (Session not found:
00000000-0000-0000-0000-000000000000)

with raft topology, e.g.,

scylla --enable-repair-based-node-ops 0 --consistent-cluster-management true --experimental-features consistent-topology-changes

Fix by setting the session id.

Fixes #16741

Closes scylladb/scylladb#16814
2024-01-18 12:47:20 +02:00
Kamil Braun
52e67ca121 raft topology: increase level of some TRACE messages
Increased them to DEBUG level, and in one case to WARN (inside an
exception handler).

The selected messages are still relatively rare (per-node per-transition
control plane events, plus events such as fibers sleeping and waking up)
although more low level. They are also small messages. Messages that are
large such as those which print all tokens of nodes or large mutations
are left on TRACE level.

The plan is to enable DEBUG level logging in test.py tests for
raft_topology, while not spamming the logs completely such as by
printing large mutations.
2024-01-18 11:24:16 +01:00
Kamil Braun
92e6604127 raft topology: log when entering transition states
Those are rare control plane events, but might be useful when debugging
problems with topology coordinator (e.g. where it got stuck).
2024-01-18 11:24:15 +01:00
Kamil Braun
aeb53ea31d raft topology: don't include null ID in exclude_nodes
Observed with newly added logs:
```
raft topology - executing global topology command barrier_and_drain, excluded nodes: {00000000-0000-0000-0000-000000000000}
```
2024-01-18 11:24:15 +01:00
Kamil Braun
ae25f703c4 raft topology: INFO log when executing global commands and updating topology state
Those are rare control plane events, but useful for debugging e.g.  if
topology coordinator gets stuck at some point.
2024-01-18 11:24:15 +01:00
Kamil Braun
71957b4320 storage_service: separate logger for raft topology
Allows selectively enabling higher logging levels for just raft-topology
related things, without doing it for the entire storage_service (which
includes things like gossiper callbacks).

Also gets rid of the redundant "raft topology:" prefix which was also
not included everywhere.
2024-01-18 11:24:14 +01:00
Kefu Chai
0ae81446ef ./: not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16766
2024-01-17 16:30:14 +02:00
Kamil Braun
787b24cd24 Merge 'raft topology: join: shut down a node on error in response handler' from Patryk Jędrzejczak
If the joining node fails while handling the response from the
topology coordinator, it hangs even though it knows the join
operation has failed. Therefore, we ensure it shuts down in
this patch.

Additionally, we ensure that if the first join request response
was a rejection or the node failed while handling it, the
following acceptances by the (possibly different) coordinator
don't succeed. The node considers the join operation as failed.
We shouldn't add it to the cluster.

Fixes scylladb/scylladb#16333

Closes scylladb/scylladb#16650

* github.com:scylladb/scylladb:
  topology_coordinator: clarify warnings
  raft topology: join: allow only the first response to be a succesful acceptance
  storage_service: join_node_response_handler: fix indentation
  raft topology: join: shut down a node on error in response handler
2024-01-17 14:55:26 +01:00
Botond Dénes
f22fc88a64 Merge 'Configure service levels interval' from Michał Jadwiszczak
Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests.

This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest.

Closes scylladb/scylladb#16394

* github.com:scylladb/scylladb:
  test:cql-pytest: change service levels intervals in tests
  configure service levels interval
2024-01-17 12:24:49 +02:00
Tomasz Grabiec
3d76aefb98 Merge "Enhance topology request status tracking" from Gleb
Currently to figure out if a topology request is complete a submitter
checks the topology state and tries to figure out from that the status
of the request. This is not exact. Lets look at rebuild handling for
instance. To figure out if request is completed the code waits for
request object to disappear from the topology, but if another rebuild
starts between the end of the previous one and the code noticing that
it completed the code will continue waiting for the next rebuild.
Another problem is that in case of operation failure there is no way to
pass an error back to the initiator.

This series solves those problems by assigning an id for each request and
tracking the status of each request in a separate table. The initiator
can query the request status from the table and see if the request was
completed successfully or if it failed with an error, which is also
evadable from the table.

The schema for the table is:

    CREATE TABLE system.topology_requests (
        id timeuuid PRIMARY KEY,

        initiating_host uuid,
        start_time timestamp,

        done boolean,
        error text,
        end_time timestamp,
    );

and all entries have TTL of one month.
2024-01-17 00:37:19 +01:00
Gleb Natapov
9a7243d71a storage_service: topology coordinator: Consolidate some mutation builder code 2024-01-16 17:02:54 +02:00
Gleb Natapov
a145a73136 storage_service: topology coordinator: make topology operation rollback error more informative
Include an error which caused the rollback.
2024-01-16 17:02:54 +02:00
Gleb Natapov
bf91eb37f2 storage_service: topology coordinator: make topology operation cancellation error more informative
Include the list of nodes that were down when cancellation happened.
2024-01-16 17:02:54 +02:00
Gleb Natapov
8beb399b72 storage_service: topology coordinator: consolidate some code in cancel_all_requests
There is a code duplication that can be avoided.
2024-01-16 17:02:54 +02:00
Gleb Natapov
fba6877b3e storage_service: topology coordinator: TTL topology request table
To prevent topology_request table growth TTL all writes to expire after
a month.
2024-01-16 17:02:54 +02:00
Gleb Natapov
d576ed31dc storage_service: topology request: drop explicit shutdown rpc
Now that we have explicit status for each request we may use it to
replace shutdown notification rpc. During a decommission, in
left_token_ring state, we set done to true after metadata barrier
that waits for all request to the decommissioning node to complete
and notify the decommissioning node with a regular barrier. At this
point the node will see that the request is complete and exit.
2024-01-16 17:02:54 +02:00
Gleb Natapov
84197ff735 storage_service: topology coordinator: check topology operation completion using status in topology_requests table
Instead of trying to guess if a request completed by looking into the
topology state (which is sometimes can be error prone) look at the
request status in the new topology_requests. If request failed report
a reason for the failure from the table.
2024-01-16 17:02:54 +02:00
Kefu Chai
2dbf044b91 cql3: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16791
2024-01-16 16:43:17 +02:00
Gleb Natapov
1c18476385 storage_service: topology coordinator: update topology_requests table with requests progress
Make topology coordinator update request's status in topology_requests table as it changes.
2024-01-16 15:35:18 +02:00
Gleb Natapov
1ce1c5001d topology coordinator: add topology_requests table to group0 snapshot
Since the table is updated through raft's group0 state machine its
content needs to be part of the snapshot.
2024-01-16 13:57:27 +02:00
Gleb Natapov
584551f849 topology coordinator: add request_id to the topology state machine
Provide a unique ID for each topology request and store it the topology
state machine. It will be used to index new topology requests table in
order to retrieve request status.
2024-01-16 13:57:27 +02:00
Avi Kivity
5e70dd1dbe database: don't allow keyspace objects to be copied
keyspace objects are heavyweight and copies are immediately our-of-date,
so copying them is bad.

Fix by deleting the copy constructor and copy assignment operator. One
call site is fixed. This call site is safe since the it's only used
for accessing a few attributes (introduced in f70c4127c6).

Closes scylladb/scylladb#16782
2024-01-15 21:48:32 +01:00
Kefu Chai
e5300f3e21 topology_state_machine: add formatter for service::cleanup_status
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define a formatter for service::cleanup_status,
and remove its operator<<().

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16778
2024-01-15 21:31:42 +02:00
Kefu Chai
ece2bd2f6e service: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16764
2024-01-15 13:29:33 +02:00
Gleb Natapov
ba7aa0d582 storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator 2024-01-14 15:45:53 +02:00
Gleb Natapov
1afc891bd5 storage_service: topology coordinator: add logging to removenode and decommission
Add some useful logging to removenode and decommission to be used by
tests later.
2024-01-14 15:45:53 +02:00
Gleb Natapov
97ab3f6622 storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
2024-01-14 15:45:53 +02:00
Gleb Natapov
0adb3904d8 storage_service: topology coordinator: manage cluster cleanup as part of the topology management
Sometimes it is unsafe to start a new topology operation before cleanup
runs on dirty nodes. This patch detects the situation when the topology
operation to be executed cannot be run safely until all dirty nodes do
cleanup and initiates the cleanup automatically. It also waits for
cleanup to complete before proceeding with the topology operation.

There can be a situation that nodes that needs cleanup dies and will
never clear the flag. In this case if a topology operation that wants to
run next does not have this node in its ignore node list it may stuck
forever. To fix this the patch also introduces the "liveness aware"
request queue management: we do not simple choose _a_ request to run next,
but go over the queue and find requests that can proceed considering
the nodes liveness situation. If there are multiple requests eligible to
run the patch introduces the order based on the operation type: replace,
join, remove, leave, rebuild. The order is such so to not trigger cleanup
needlessly.
2024-01-14 15:45:50 +02:00
Gleb Natapov
c9b7bd5a33 storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter
Needed by the next patch.
2024-01-14 14:44:07 +02:00
Gleb Natapov
067267ff76 storage_service: topology coordinator: make topology coordinator lifecycle subscriber
We want to change the coordinator to consider nodes liveness when
processing the topology operation queue. If there is no enough live
nodes to process any of the ops we want to cancel them. For that to work
we need to be able to kick the coordinator if liveness situation
changes.
2024-01-14 14:44:07 +02:00
Gleb Natapov
f70c4127c6 storage_service: topology coordinator: introduce sstable cleanup fiber
Introduce a fiber that waits on a topology event and when it sees that
the node it runs on needs to perform sstable cleanup it initiates one
for each non tablet, non local table and resets "cleanup" flag back to
"clean" in the topology.
2024-01-14 14:44:07 +02:00
Gleb Natapov
5b246920ae storage_proxy: allow to wait for all ongoing writes
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
2024-01-14 14:44:07 +02:00
Gleb Natapov
b2ba77978c storage_service: topology coordinator: mark nodes as needing cleanup when required
A cleanup needs to run when a node loses an ownership of a range (during
bootstrap) or if a range movement to an normal node failed (removenode,
decommission failure). Mark all dirty node as "cleanup needed" in those cases.
2024-01-14 14:43:59 +02:00
Gleb Natapov
dbededb1a6 storage_service: add mark_nodes_as_cleanup_needed function
The function creates a mutation that sets cleanup to "needed" for each
normal node that, according to the erm, has data it does not own after
successful or unsuccessful topology operation.
2024-01-14 14:43:33 +02:00
Gleb Natapov
cc54796e23 raft topology: add cleanup state to the topology state machine
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
2024-01-14 13:30:54 +02:00
Kamil Braun
4e18f8b453 Merge 'topology_state_load: stop waiting for IP-s' from Petr Gusev
The loop in `id2ip` lambda makes problems if we are applying an old raft
log that contains long-gone nodes. In this case, we may never receive
the `IP` for a node and stuck in the loop forever. In this series we
replace the loop with an if - we just don't update the `host_id <-> ip`
mapping in the `token_metadata.topology` if we don't have an `IP` yet.

The PR moves `host_id -> IP` resolution to the data plane, now it
happens each time the IP-based methods of `erm` are called. We need this
because IPs may not be known at the time the erm is built. The overhead
of `raft_address_map` lookup is added to each data plane request, but it
should be negligible. In this PR `erm/resolve_endpoints` continues to
treat missing IP for `host_id` as `internal_error`, but we plan to relax
this in the follow-up (see this PR first comment).

Closes scylladb/scylladb#16639

* github.com:scylladb/scylladb:
  raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater
  gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
  storage_service: topology_state_load: remove IP waiting loop
  storage_service: sync_raft_topology_nodes: add target_node parameter
  storage_service: sync_raft_topology_nodes: move loops to the end
  storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node
  storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node
  storage_service: sync_raft_topology_nodes: move update_topology up
  storage_service: topology_state_load: remove clone_async/clear_gently overhead
  storage_service: fix indentation
  storage_service: extract sync_raft_topology_nodes
  storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
  address_map: move gossiper subscription logic into storage_service
  topology_coordinator: exec_global_command: small refactor, use contains + reformat
  storage_service: wait_for_ip for new nodes
  storage_service.idl.hh: fix raft_topology_cmd.command declaration
  erm: for_each_natural_endpoint_until: use is_vnode == true
  erm: switch the internal data structures to host_id-s
  erm: has_pending_ranges: switch to host_id
2024-01-12 18:46:51 +01:00
Petr Gusev
e24bee545b raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater 2024-01-12 18:29:22 +04:00
Petr Gusev
6e7bbc94f4 gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
When a node changes its IP we need to store the mapping in
system.peers and update token_metadata.topology and erm
in-memory data structures.

The test_change_ip was improved to verify this new
behaviour. Before this patch the test didn't check
that IPs used for data requests are updated on
IP change. In this commit we add the read/write check.
It fails on insert with 'node unavailable'
error without the fix.
2024-01-12 18:28:57 +04:00
Petr Gusev
6d6e1ba8fb storage_service: topology_state_load: remove IP waiting loop
The loop makes problems if we are applying an old
raft log that contains long-gone nodes. In this case, we may
never receive the IP for a node and stuck in the loop forever.

The idea of the patch is to replace the loop with an
if - we just don't update the host_id <-> ip mapping
in the token_metadata.topology if we don't have an IP yet.
When we get the mapping later, we'll call
sync_raft_topology_nodes again from
gossiper_state_change_subscriber_proxy.
2024-01-12 15:37:50 +04:00
Petr Gusev
260874c860 storage_service: sync_raft_topology_nodes: add target_node parameter
If it's set, instead of going over all the nodes in raft topology,
the function will update only the specified node. This parameter
will be used in the next commit, in the call to sync_raft_topology_nodes
from gossiper_state_change_subscriber_proxy.
2024-01-12 15:37:50 +04:00
Petr Gusev
a9d58c3db5 storage_service: sync_raft_topology_nodes: move loops to the end 2024-01-12 15:37:50 +04:00
Petr Gusev
d1bce3651b storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node 2024-01-12 15:37:50 +04:00