test/auth_cluster: make test_service_level_metric_name_change useful

The test test_service_level_metric_name_change was originally introduced to serve as a regression test for scylladb/scylla-enterprise#4912. Before the fix, some per-scheduling-group metrics would not get adjusted when the scheduling group gets renamed (which does happen for SL-managed scheduling groups) and it would be possible to attempt to register metrics with the same set of labels, resulting in an error. However, in scylladb/scylla-enterprise#4764, another bug was fixed which affected the test. Before a service level is created, a "test" scheduling group can be created by service level controller if it is unsure whether it is allowed to create more scheduling groups or not. If creation of the scheduling group succeeds, it is put into the pool of scheduling groups to be reused when a new service level is created. Therefore, the node handling CREATE SERVICE LEVEL would always use the scheduling group that was originally created for the sake of the test as a SG for the new service level. All of the above is intentional and was actually fixed by the aforementioned issue. However, the test scheduling groups would always get unique names and, therefore, the error would no longer reproduce. However, the faulty logic that ran previously and caused the bug still runs - when a node updates its service levels cache on group0 reload. The test previously used only one node. Fix it by starting two nodes instead of one at the beginning of the test and by serving all service level commands to the first node - were the issue not fixed, the error would get triggered on the second node.
2025-01-20 17:26:58 +01:00
parent de153a2ba7
commit 780ff17ff5
1 changed files with 13 additions and 7 deletions
--- a/test/auth_cluster/test_raft_service_levels.py
+++ b/test/auth_cluster/test_raft_service_levels.py
@@ -445,21 +445,27 @@ async def test_service_levels_over_limit(manager: ManagerClient):
 # Reproduces issue scylla-enterprise#4912
@pytest.mark.asyncio
 async def test_service_level_metric_name_change(manager: ManagerClient) -> None:
-    s = await manager.server_add()
-    await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30)
+    servers = await manager.servers_add(2)
+    s = servers[0]
    cql = manager.get_cql()
+    [h] = await wait_for_cql_and_get_hosts(cql, [s], time.time() + 60)
+    await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30)

    sl1 = unique_name()
    sl2 = unique_name()

+    # All service level commands need to run on the first node. It is the logic
+    # that is exercised during service level data reload from group0 which is
+    # prone to name reuse and, hence, could trigger the error fixed by 4912.
+
    # creates scheduling group `sl:sl1`
-    await cql.run_async(f"CREATE SERVICE LEVEL {sl1}")
+    await cql.run_async(f"CREATE SERVICE LEVEL {sl1}", host=h)
    # renames scheduling group `sl:sl1` to `sl_deleted:sl1`
-    await cql.run_async(f"DROP SERVICE LEVEL {sl1}")
+    await cql.run_async(f"DROP SERVICE LEVEL {sl1}", host=h)
    # renames scheduling group `sl_deleted:sl1` to `sl:sl2`
-    await cql.run_async(f"CREATE SERVICE LEVEL {sl2}")
+    await cql.run_async(f"CREATE SERVICE LEVEL {sl2}", host=h)
    # creates scheduling group `sl:sl1`
-    await cql.run_async(f"CREATE SERVICE LEVEL {sl1}")
+    await cql.run_async(f"CREATE SERVICE LEVEL {sl1}", host=h)
    # In issue #4912, service_level_controller thought there was no room
    # for `sl:sl1` scheduling group because create_scheduling_group() failed due to
    # `seastar::metrics::double_registration (registering metrics twice for metrics: transport_cql_requests_count)`
@@ -467,7 +473,7 @@ async def test_service_level_metric_name_change(manager: ManagerClient) -> None:
    # When sl2 is dropped, service_level_controller tries to rename its
    # scheduling group to `sl:sl1`, triggering 
    # `seastar::metrics::double_registration (registering metrics twice for metrics: scheduler_runtime_ms)`
-    await cql.run_async(f"DROP SERVICE LEVEL {sl2}")
+    await cql.run_async(f"DROP SERVICE LEVEL {sl2}", host=h)

    # Check if group0 is healthy
    s2 = await manager.server_add()