This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.
We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.
If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.
Also will be used in tests to ensure tablets stick to expected locations.
Prevents stale streaming operation from running beyond topology
operation they were started in. After the session field is cleared, or
changed to something else, the old topology_guard used by streaming is
interrupted and fenced and the next barrier will join with any
remaining work.
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
utils::fb_utilities is a global in-memory registry for storing and retrieving broadcast_address and broadcat_rpc_address.
As part of the effort to get rid of all global state, this series gets rid of fb_utilities.
This will eventually allow e.g. cql_test_env to instantiate multiple scylla server nodes, each serving on its own address.
Closesscylladb/scylladb#16250
* github.com:scylladb/scylladb:
treewide: get rid of now unused fb_utilities
tracing: use locator::topology rather than fb_utilities
streaming: use locator::topology rather than fb_utilities
raft: use locator::topology/messaging rather than fb_utilities
storage_service: use locator::topology rather than fb_utilities
storage_proxy: use locator::topology rather than fb_utilities
service_level_controller: use locator::topology rather than fb_utilities
misc_services: use locator::topology rather than fb_utilities
migration_manager: use messaging rather than fb_utilities
forward_service: use messaging rather than fb_utilities
messaging_service: accept broadcast_addr in config rather than via fb_utilities
messaging_service: move listen_address and port getters inline
test: manual: modernize message test
table: use gossiper rather than fb_utilities
repair: use locator::topology rather than fb_utilities
dht/range_streamer: use locator::topology rather than fb_utilities
db/view: use locator::topology rather than fb_utilities
database: use locator::topology rather than fb_utilities
db/system_keyspace: use topology via db rather than fb_utilities
db/system_keyspace: save_local_info: get broadcast addresses from caller
db/hints/manager: use locator::topology rather than fb_utilities
db/consistency_level: use locator::topology rather than fb_utilities
api: use locator::topology rather than fb_utilities
alternator: ttl: use locator::topology rather than fb_utilities
gossiper: use locator::topology rather than fb_utilities
gossiper: add get_this_endpoint_state_ptr
test: lib: cql_test_env: pass broadcast_address in cql_test_config
init: get_seeds_from_db_config: accept broadcast_address
locator: replication strategies: use locator::topology rather than fb_utilities
locator: topology: add helpers to retrieve this host_id and address
snitch: pass broadcast_address in snitch_config
snitch: add optional get_broadcast_address method
locator: ec2_multi_region_snitch: keep local public address as member
ec2_multi_region_snitch: reindent load_config
ec2_multi_region_snitch: coroutinize load_config
ec2_snitch: reindent load_config
ec2_snitch: coroutinize load_config
thrift: thrift_validation: use std::numeric_limits rather than fb_utilities
topology_guard is used to track distributed operations started by the
topology change coordinator, e.g. streaming, to make sure that those
operations have no side effects after topology change coordinator
moved to the next migration stage, of a given tablet or of the whole
ring.
topology_guard can be sent over the wire in the form of
frozen_topology_guard. It can be materialized again on the other
side. While in transit, it doesn't block the coordinator barriers. But
if the coordinator moved on, materialization of the guard will
fail. So tracking safety is preserved.
In this patch, the guard implementation is based on tracking work
under global sessions, but the concept is flexible and other
mechanisms can be used without changing user code.
Storage service uses group0 internally, but group0 is create long after
storage service is initialized and passed to it using ss::set_group0()
function. What it means is that during shutdown group0 is destroyed
before ss::stop() is called and thus storage service is left with a
dangling reference. Fix it by introducing a function that cancels all
group0 operations and waits for background fibers to complete. For that
we need separate abort source for group0 operation which the patch
series also introduces.
* 'gleb/group0-ss-shutdown' of github.com:scylladb/scylla-dev:
storage_service: topology coordinator: ignore abort_requested_exception in background fibers
storage_service: fix de-initialization order between storage service and group0_service
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Storage service uses group0 internally, but group0 is create long after
storage service is initialized and passed to it using ss::set_group0()
function. But what it means is that during shutdown group0 is destroyed
before ss::stop() is called and thus storage service is left with a
dangling reference. Fix it by introducing a function that cancels all
group0 operations and waits for background fibers to complete. For that
we need separate abort source for group0 operation which the patch also
introduces.
This series adds handling for more failures during a topology operation
(we already handle a failure during streaming). Here we add handling of
tablet draining errors by aborting the operation and handling of errors
after streaming where an operation cannot be aborted any longer. If the
error happens when rollback is no longer possible we wait for ring delay
and proceed to the next step. Each individual patch that adds the sleep
has an explanation what the consequences of the patch are.
* 'gleb/topology-coordinator-failures' of github.com:scylladb/scylla-dev:
test: add test to check errro handling during tablet draining
test: fix test_topology_streaming_failure test to not grep the whole file
storage_service: add error injection into the tablet migration code
storage_service: topology coordinator: rollback on handle_tablet_migration failure during tablet_draining stage
storage_service: topology coordinator: do not retry the metadata barrier forever in write_both_read_new state
storage_service: topology coordinator: do not retry the metadata barrier forever in left_token_ring state
storage_service: topology coordinator: return a node that is being removed from get_excluded_nodes
storage_service: topology_coordinator: use new rollback_to_normal state in the rollback procedure
storage_service: topology coordinator: add rollback_to_normal node state
storage_service: topology coordinator: put fence version into the raft state
storage_service: topology coordinator: do fencing even if draining failed
This bug manifested as delays in DDL statement execution, which had to
wait until streaming is finished so that the topology change
coordinator releases the guard.
The reason is that topology change coordinator didn't release the
group0 guard if there is no work to do with active migrations, and
awaits the condition variable without leaving the scope.
Fixes#16182Closesscylladb/scylladb#16183
During remove or decommission as a first step tables are drained from
the leaving node. Theoretically this step may fail. Rollback the
topology operation if it happen. Since some tables may stay in migration
state the topology needs to go to the tablet_migration state. Lets do it
always since it should be save to do it even if there is no on going
tablet migrations.
Handle the barrier failure by sleeping for a "ring delay" and
continuing. The purpose of the barrier is to wait for all reads to
old replica set to complete and fence the remaining requests. If the
barrier fails we give the fence some time to propagate and continue with
the topology change. Of fence did not propagate we may have stale reads,
but this is not worse that we have with gossiper.
Handle the barrier failure by sleeping for a "ring delay" and
continuing. The purpose of the barrier is to wait for unfinished writes
to decommissioned node complete. If barrier fails we give them some time
to complete and then proceed with node decommission. The worse thing
that may happen if some write will fail because the node will be
shutdown.
Go through the rollback_to_normal state when the node needs to move to
normal during the rollback and update fence in this state before moving
the node to normal. This guaranties that the fence update will not
be missed. Not that when a node moves to left state it already passes
through left_token_ring which guaranties the same.
When a topology coordinator rolls back from unsuccessful topology operation it
advances the fence (which is now in the raft state) after moving to normal
state. We do not want this to fail (only majority of nodes is needed for
it to not to), but currently it may fail in case the coordinator moves
to another node after changing the rollback node's state to normal, but
before updating the fence. To solve that the rollback operation needs to
go through a new rollback_to_normal state that will do the fencing
before moving to normal. This patch introduces that state, but does not use
it yet.
The replace operation is defined to succeed only if the node being
replaced is dead. We should reject this operation when the failure
detector considers the node being replaced alive.
Apart from adding this change, this PR adds a test case -
`test_replacing_alive_node_fails` - that verifies it. A few testing
framework adjustments were necessary to implement this test and
to avoid flakiness in other tests that use the replace operation after
the change. From now, we need to ensure that all nodes see the
node being replaced as dead before starting the replace. Otherwise,
the check added in this PR could reject the replace.
Additionally, this PR changes the replace procedure in a way that
if the replacing node reuses the IP of the node being replaced, other
nodes can see it as alive only after the topology coordinator accepts
its join request. The replacing node may become alive before the
topology coordinator checks if the node being replaced is dead. If
that happens and the replacing node reuses the IP of the node being
replaced, the topology coordinator cannot know which of these two
nodes is alive and whether it should reject the join request.
Fixes#15863Closesscylladb/scylladb#15926
* github.com:scylladb/scylladb:
test: add test_replacing_alive_node_fails
raft topology: reject replace if the node being replaced is not dead
raft topology: add the gossiper ref to topology_coordinator
test: test_cluster_features: stop gracefully before replace
test: decrease failure_detector_timeout_in_ms in replace tests
test: move test_replace to topology_custom
test: server_add: wait until the node being replaced is dead
test: server_add: add support for expected errors
raft topology: join: delay advertising replacing node if it reuses IP
raft topology: join: fix a condition in validate_joining_node
The replace operation is defined to succeed only if the node being
replaced is dead. We should reject this operation when the failure
detector considers the node being replaced alive.
After this change, other nodes can see the replacing node as alive
only after the topology coordinator accepts its join request.
In the following commits, we make the topology coordinator reject
join requests if the node being replaced is considered alive by the
gossiper. However, the replacing node may become alive before the
topology coordinator does the validation. If the replacing node
reuses the IP of the node being replaced, the topology coordinator
cannot know which of these two nodes is alive and whether it should
reject the join request.
The gossiper-based topology also delays the replacing node from
advertising itself if it reuses the IP. To achieve the same effect
in raft-based topology, we only need to move the definition of
replacing_a_node_with_same_ip. However, there is a code that puts
bootstrap tokens of the node being replaced into the gossiper
state, and it depends on replacing_a_node_with_same_ip and
replacing_a_node_with_diff_ip being always false in the raft-based
topology mode. We prevent it from breaking by changing the
condition.
All the services that need to register RPC handlers do it in service
constructor or .start() method. Unregistration happens in .stop().
Storage service explicitly (de)initializes its RPC handlers in dedicated
calls, but there's no point in that. The handlers' accessibility is
determined by messaging service start_lister/shutdown, handlers
themselves can be registered any time before it and unregistered any
time after it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handlers are about to be initialized from inside storage_service
constructor. At that time container() is not yet available and its
invalid to capture it on handlers' lambda. Fortunately, there's only one
handler that does it, other handlers capture 'this' and call container()
explicitly. This patch fixes the remaining one to do the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main goal here is to drop sys.dist.ks argument from the
init_messaging_service call to make future patching simpler. While doing
it it turned out that the argument was needed to be passed all the way
down to the mark_existing_views_as_built(), so this patch also dropes
this argument from this whole call-trace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This effectively reverts bc051387c5 (storage_service: Remove sys_dist_ks
from storage_service dependencies) since now storage service needs the
sys. disk. ks not only cluster join time. Next patch will make more use
of it as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now set via a dedicated call that happens after query processor is
started. Now query processor is started before storage service and the
latter can get the q.p. local reference via constructor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently when the coordinator decides to move the fence it issues an
RPC to each node and each node locally advances fence version. This is
fine if there are no failures or failures are handled by retrying
fencing, but if we want to allow topology changes to progress even in
the presence of barrier failures it is easier to store the fence version
in the raft state. The nodes that missed fence rpc may easily catch up
to the latest fence version by simply executing a raft barrier.
Token metadata barrier consists for two steps. First old request are
drained and then requests that are not drained are fenced. But currently
if draining fails then fencing is note done. This is fine if the
barrier's failure handled by retrying, but we when to start handling
errors differently. In fact during topology operation rollback we
already do not retry failed barrier.
The patch fixes the metadata barrier to do fencing even if draining
failed.
When a node tries to join the cluster, it asks the topology
coordinator to add them and then waits for the response. The
response is not guaranteed to come back. If the topology
coordinator cannot contact the joining node, it moves the node to
the left state and moves on.
Currently, to handle the case when the response does not come back,
the joining node gives up waiting for it after 3 minutes. However,
it might take more time for the topology coordinator to start
handling the request to join, as it might be working on other tasks
like adding other nodes, performing tablet migrations, etc. In
general, any timeout duration would be unreliable.
Therefore, we get rid of the timeout. From now on, the operator
will be responsible for shutting down the node if the topology
coordinator fails to deliver the rejection.
Additionally, after removing the timeout, we adjust the topology
coordinator. We make it try sending the response (both acceptance
and rejection) only once since we do not care if it fails anymore. We
only need to ensure that the joining node is moved to the left state
if sending fails.
Fixes#15865Closesscylladb/scylladb#15944
* github.com:scylladb/scylladb:
raft topology: fix indentation
raft topology: join: try sending the response only once
raft topology: join: do not time out waiting for the node to be joined
group 0: group0_handshaker: add the abort_source parameter to post_server_start
Remove `fall_back_to_syn_msg` which is not necessary in newer Scylla versions.
Fix the calculation of `nodes_down` which could count a single node multiple times.
Make shadow round mandatory during bootstrap and replace -- these operations are unsafe to do without checking features first, which are obtained during the shadow round (outside raft-topology mode).
Finally, during node restart, allow the shadow round to be skipped when getting `timeout_error`s from contact points, not only when getting `closed_error`s (during restart it's best-effort anyway, and in general it's impossible to distinguish between a dead node and a partitioned node).
More details in commit messages.
Ref: https://github.com/scylladb/scylladb/issues/15675Closesscylladb/scylladb#15941
* github.com:scylladb/scylladb:
gossiper: do_shadow_round: increment `nodes_down` in case of timeout
gossiper: do_shadow_round: fix `nodes_down` calculation
storage_service: make shadow round mandatory during bootstrap/replace
gossiper: do_shadow_round: remove default value for nodes param
gossiper: do_shadow_round: remove `fall_back_to_syn_msg`
When a node tries to join the cluster, it asks the topology
coordinator to add them and then waits for the response.
In the previous commit, we have made the operator responsible for
shutting down the joining node if the topology coordinator fails
to deliver a response by removing the timeout. In this commit, we
adjust the topology coordinator. We make it try sending the
response (both acceptance and rejection) only once since we do not
care if it fails anymore. We only need to ensure that the joining
node is moved to the left state if sending fails.
When a node tries to join the cluster, it asks the topology
coordinator to add them and then waits for the response. The
response is not guaranteed to come back. If the topology
coordinator cannot contact the joining node, it moves the node to
the left state and moves on.
Currently, to handle the case when the response does not come back,
the joining node gives up waiting for it after 3 minutes. However,
it might take more time for the topology coordinator to start
handling the request to join, as it might be working on other tasks
like adding other nodes, performing tablet migrations, etc. In
general, any timeout duration would be unreliable.
Therefore, we get rid of the timeout. From now on, the operator
will be responsible for shutting down the node if the topology
coordinator fails to deliver the rejection.
This change additionally fixes the TODO in
raft_group0::join_group0.