test: unflake test_topology_remove_garbage_group0

The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
   peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```

Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.

Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.

Fixes scylladb/scylladb#16373

(cherry picked from commit a68701ed4fe1a0848044f2d692342810ef868dae)

Closes scylladb/scylladb#17702
This commit is contained in:
Kamil Braun
2024-03-07 13:04:23 +01:00
parent 02182caff4
commit 6e01e821d7

View File

@@ -9,8 +9,10 @@ Test removenode with node with node no longer member
import logging
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import inject_error_one_shot
from test.pylib.util import wait_for_cql_and_get_hosts
from test.topology.util import get_token_ring_host_ids, get_current_group0_config, \
check_token_ring_and_group0_consistency
check_token_ring_and_group0_consistency, wait_for_token_ring_and_group0_consistency
import time
import pytest
@@ -27,6 +29,15 @@ async def test_remove_garbage_group0_members(manager: ManagerClient):
cfg = {'enable_user_defined_functions': False,
'experimental_features': list[str]()}
servers = [await manager.server_add(config=cfg) for _ in range(4)]
# Make sure that the driver has connected to all nodes, and they see each other as NORMAL
# (otherwise the driver may remove connection to some host, even after it manages to connect to it,
# because the node that it has control connection to considers that host as not NORMAL yet).
# This ensures that after we stop/remove some nodes in the test, the driver will still
# be able to connect to the remaining nodes. See scylladb/scylladb#16373
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 60)
await wait_for_cql_and_get_hosts(manager.get_cql(), servers, time.time() + 60)
removed_host_id = await manager.get_host_id(servers[0].server_id)
await manager.server_stop_gracefully(servers[0].server_id)