service: migration_manager: Run group0 barrier in gossip scheduling group

Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925

(cherry picked from commit ee2fa58bd6)
This commit is contained in:
Tomasz Grabiec
2025-07-11 15:48:44 +02:00
committed by GitHub Action
parent 36d2f80f38
commit 434ecdee0e

View File

@@ -56,7 +56,9 @@ migration_manager::migration_manager(migration_notifier& notifier, gms::feature_
, _group0_barrier(this_shard_id() == 0 ?
std::function<future<>()>([this] () -> future<> {
// This will run raft barrier and will sync schema with the leader
(void)co_await start_group0_operation();
return with_scheduling_group(_storage_proxy.get_db().local().get_gossip_scheduling_group(), [this] {
return start_group0_operation().discard_result();
});
}) :
std::function<future<>()>([this] () -> future<> {
co_await container().invoke_on(0, [] (migration_manager& mm) -> future<> {