Files
scylla/service
Petr Gusev 30b2e5838c storage_service: sync_raft_topology_nodes: force_remove_endpoint for left nodes only if an IP is not used by other nodes
Before the patch we called gossiper.remove_endpoint for IP-s
of the left nodes. The problem is that in replace-with-same-ip
scenario we called gossiper.remove_endpoint for IP which is
used by the new, replacing node. The gossiper.remove_endpoint
method puts the IP into quarantine, which means gossiper will
ignore all events about this IP for quarantine_delay (one minute by
default). If we immediately replace just replaced node with
the same IP again, the bootstrap will fail since the gossiper
events are blocked for this IP, and we won't be able to
resolve an IP for the new host_id.

Another problem was that we called gossiper.remove_endpoint
method, which doesn't remove an endpoint from _endpoint_state_map,
only from live and unreachable lists. This means the IP
will keep circulating in the gossiper message exchange between cluster
nodes until full cluster restart.

This patch fixes both of these problems. First, we rely on
the fact that when topology coordinator moves the being_replaced
node to the left state, the IP of the replacing node is known to all nodes.
This means before removing an IP from the gossiper we can check if
this IP is currently used by another node in the current raft topology.
This is done by constructing the used_ips map based on normal and
transition nodes. This map is cached to avoid quadratic behaviour.

Second, we call gossiper.force_remove_endpoint, not
gossiper.remove_endpoint. This function removes and IP from
_endpoint_state_map, as well as from live and unreachable lists.

The tests for both of these improvements will be added in subsequent
commits.
2024-01-19 12:24:04 +04:00
..
2024-01-17 16:30:14 +02:00