raft topology: add error for removal of non-normal nodes

In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd98ee5af319849211ef7de3ffc95d86)

Closes scylladb/scylladb#20801
This commit is contained in:
Abhinav
2024-09-17 14:36:03 +05:30
committed by Kamil Braun
parent c75d58aef5
commit d8b66cf6ef
2 changed files with 21 additions and 0 deletions

View File

@@ -3926,6 +3926,11 @@ future<> storage_service::raft_removenode(locator::host_id host_id, std::list<lo
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, ::format("removenode: request remove for {}", id));
request_id = guard.new_group0_state_id();
if (auto itr = _topology_state_machine._topology.requests.find(id);
itr != _topology_state_machine._topology.requests.end() && itr->second == topology_request::remove) {
throw std::runtime_error("Removenode failed. Concurrent request for removal already in progress");
}
try {
// Make non voter during request submission for better HA
co_await _group0->make_nonvoters(ignored_ids, _group0_as, raft_timeout{});

View File

@@ -135,3 +135,19 @@ async def test_rebuild_node(manager: ManagerClient, random_tables: RandomTables)
servers = await manager.running_servers()
await manager.rebuild_node(servers[0].server_id)
await check_token_ring_and_group0_consistency(manager)
@pytest.mark.asyncio
async def test_concurrent_removenode(manager: ManagerClient):
servers = await manager.running_servers()
assert len(servers) >= 3
await manager.server_stop_gracefully(servers[2].server_id)
try:
await asyncio.gather(*[manager.remove_node(servers[0].server_id, servers[2].server_id),
manager.remove_node(servers[1].server_id, servers[2].server_id)])
except Exception as e:
logger.info(f"exception raised due to concurrent remove node requests: {e}")
else:
raise Exception("concurrent removenode request should result in a failure, but unexpectedly succeeded")