Files
scylla/service
Kamil Braun 03ecc8457c Merge 'raft topology: reject replace if the node being replaced is not dead' from Patryk Jędrzejczak
The replace operation is defined to succeed only if the node being
replaced is dead. We should reject this operation when the failure
detector considers the node being replaced alive.

Apart from adding this change, this PR adds a test case -
`test_replacing_alive_node_fails` - that verifies it. A few testing
framework adjustments were necessary to implement this test and
to avoid flakiness in other tests that use the replace operation after
the change. From now, we need to ensure that all nodes see the
node being replaced as dead before starting the replace. Otherwise,
the check added in this PR could reject the replace.

Additionally, this PR changes the replace procedure in a way that
if the replacing node reuses the IP of the node being replaced, other
nodes can see it as alive only after the topology coordinator accepts
its join request. The replacing node may become alive before the
topology coordinator checks if the node being replaced is dead. If
that happens and the replacing node reuses the IP of the node being
replaced, the topology coordinator cannot know which of these two
nodes is alive and whether it should reject the join request.

Fixes #15863

Closes scylladb/scylladb#15926

* github.com:scylladb/scylladb:
  test: add test_replacing_alive_node_fails
  raft topology: reject replace if the node being replaced is not dead
  raft topology: add the gossiper ref to topology_coordinator
  test: test_cluster_features: stop gracefully before replace
  test: decrease failure_detector_timeout_in_ms in replace tests
  test: move test_replace to topology_custom
  test: server_add: wait until the node being replaced is dead
  test: server_add: add support for expected errors
  raft topology: join: delay advertising replacing node if it reuses IP
  raft topology: join: fix a condition in validate_joining_node
2023-11-23 10:31:59 +01:00
..
2023-06-06 13:29:16 +03:00