test/auth_cluster: make test_service_level_metric_name_change useful

The test test_service_level_metric_name_change was originally introduced
to serve as a regression test for scylladb/scylla-enterprise#4912.
Before the fix, some per-scheduling-group metrics would not get adjusted
when the scheduling group gets renamed (which does happen for SL-managed
scheduling groups) and it would be possible to attempt to register
metrics with the same set of labels, resulting in an error.

However, in scylladb/scylla-enterprise#4764, another bug was fixed which
affected the test. Before a service level is created, a "test"
scheduling group can be created by service level controller if it is
unsure whether it is allowed to create more scheduling groups or not. If
creation of the scheduling group succeeds, it is put into the pool of
scheduling groups to be reused when a new service level is created.
Therefore, the node handling CREATE SERVICE LEVEL would always use the
scheduling group that was originally created for the sake of the test as
a SG for the new service level.

All of the above is intentional and was actually fixed by the
aforementioned issue. However, the test scheduling groups would always
get unique names and, therefore, the error would no longer reproduce.
However, the faulty logic that ran previously and caused the bug still
runs - when a node updates its service levels cache on group0 reload.

The test previously used only one node. Fix it by starting two nodes
instead of one at the beginning of the test and by serving all service
level commands to the first node - were the issue not fixed, the error
would get triggered on the second node.
This commit is contained in:
Piotr Dulikowski
2025-01-20 17:26:58 +01:00
parent de153a2ba7
commit 780ff17ff5

View File

@@ -445,21 +445,27 @@ async def test_service_levels_over_limit(manager: ManagerClient):
# Reproduces issue scylla-enterprise#4912
@pytest.mark.asyncio
async def test_service_level_metric_name_change(manager: ManagerClient) -> None:
s = await manager.server_add()
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30)
servers = await manager.servers_add(2)
s = servers[0]
cql = manager.get_cql()
[h] = await wait_for_cql_and_get_hosts(cql, [s], time.time() + 60)
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30)
sl1 = unique_name()
sl2 = unique_name()
# All service level commands need to run on the first node. It is the logic
# that is exercised during service level data reload from group0 which is
# prone to name reuse and, hence, could trigger the error fixed by 4912.
# creates scheduling group `sl:sl1`
await cql.run_async(f"CREATE SERVICE LEVEL {sl1}")
await cql.run_async(f"CREATE SERVICE LEVEL {sl1}", host=h)
# renames scheduling group `sl:sl1` to `sl_deleted:sl1`
await cql.run_async(f"DROP SERVICE LEVEL {sl1}")
await cql.run_async(f"DROP SERVICE LEVEL {sl1}", host=h)
# renames scheduling group `sl_deleted:sl1` to `sl:sl2`
await cql.run_async(f"CREATE SERVICE LEVEL {sl2}")
await cql.run_async(f"CREATE SERVICE LEVEL {sl2}", host=h)
# creates scheduling group `sl:sl1`
await cql.run_async(f"CREATE SERVICE LEVEL {sl1}")
await cql.run_async(f"CREATE SERVICE LEVEL {sl1}", host=h)
# In issue #4912, service_level_controller thought there was no room
# for `sl:sl1` scheduling group because create_scheduling_group() failed due to
# `seastar::metrics::double_registration (registering metrics twice for metrics: transport_cql_requests_count)`
@@ -467,7 +473,7 @@ async def test_service_level_metric_name_change(manager: ManagerClient) -> None:
# When sl2 is dropped, service_level_controller tries to rename its
# scheduling group to `sl:sl1`, triggering
# `seastar::metrics::double_registration (registering metrics twice for metrics: scheduler_runtime_ms)`
await cql.run_async(f"DROP SERVICE LEVEL {sl2}")
await cql.run_async(f"DROP SERVICE LEVEL {sl2}", host=h)
# Check if group0 is healthy
s2 = await manager.server_add()