logalloc: prevent false positives in reclaim_timer

reclaim_timer uses a coarse clock, but does not account for the measurement error introduced by that -- it can falsely report reclaims as stalls, even if they are shorter by a full coarse clock tick from the requested threshold (blocked-reactor-notify-ms). Notably, if the stall threshold happens to be smaller or equal to coarse clock resolution, Scylla's log gets spammed with false stall reports. The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is typically equal to 1 ms or 4 ms, and stall thresholds of this order can occur in practice. Eliminate false positives by requiring the measured reclaim duration to be at least 1 clock tick longer than the configured threshold for it to be considered a stall. Fixes #10981 Closes #11680
2022-09-30 20:15:30 +02:00
parent 372eadf542
commit 4563cbe595
2 changed files with 34 additions and 1 deletions
--- a/utils/coarse_steady_clock.hh
+++ b/utils/coarse_steady_clock.hh
@@ -32,6 +32,12 @@ struct coarse_steady_clock {
        clock_gettime(CLOCK_MONOTONIC_COARSE, &tp);
        return time_point(std::chrono::seconds(tp.tv_sec) + std::chrono::nanoseconds(tp.tv_nsec));
    };
+
+    static duration get_resolution() noexcept {
+        timespec tp;
+        clock_getres(CLOCK_MONOTONIC_COARSE, &tp);
+        return std::chrono::seconds(tp.tv_sec) + std::chrono::nanoseconds(tp.tv_nsec);
+    }
 };

 };
--- a/utils/logalloc.cc
+++ b/utils/logalloc.cc
@@ -1094,6 +1094,12 @@ public:
 struct reclaim_timer {
    using extra_logger = noncopyable_function<void(log_level)>;
 private:
+    // CLOCK_MONOTONIC_COARSE is not quite what we want -- to look for stalls,
+    // we want thread time, not wall time. Wall time will give false positives
+    // if the process is descheduled.
+    // For this reason Seastar uses CLOCK_THREAD_CPUTIME_ID in its stall detector.
+    // Unfortunately, CLOCK_THREAD_CPUTIME_ID_COARSE does not exist.
+    // It's not an important problem, though.
    using clock = utils::coarse_steady_clock;
    struct stats {
        occupancy_stats region_occupancy;
@@ -1427,7 +1433,28 @@ size_t segment_pool::segments_in_use() const noexcept {
 }

 reclaim_timer::reclaim_timer(const char* name, is_preemptible preemptible, size_t memory_to_release, size_t segments_to_release, tracker::impl& tracker, segment_pool& segment_pool, extra_logger extra_logs)
-    : _duration_threshold(engine().get_blocked_reactor_notify_ms())
+    : _duration_threshold(
+            // We only report reclaim stalls when their measured duration is
+            // bigger than the threshold by at least one measurement error
+            // (clock resolution). This prevents false positives.
+            //
+            // Explanation for the 10us: The clock value is not always an
+            // integral multiply of its resolution. In the case of coarse
+            // clocks, resolution only describes the frequency of syncs with
+            // the hardware clock -- no effort is made to round the values to
+            // resolution. Therefore, tick durations vary slightly in both
+            // directions. We subtract something slightly bigger than these
+            // variations, to accomodate blocked-reactor-notify-ms values which
+            // are multiplies of resolution.
+            // E.g. with kernel CONFIG_HZ=250, coarse clock resolution is 4ms.
+            // If also we also have blocked-reactor-notify-ms=4, then we would
+            // like to report two-tick stalls, since they have durations of
+            // 4ms-8ms. But two-tick durations can be just slightly smaller
+            // than 8ms (e.g. 7999us) due to the inaccuracy. So we set the
+            // threshold not to (blocked_reactor_notify_ms + resolution) = 8000us,
+            // but to (blocked_reactor_notify_ms + resolution - 10us) = 7990us,
+            // to account for this.
+            engine().get_blocked_reactor_notify_ms() + std::max(0ns, clock::get_resolution() - 10us))
    , _name(name)
    , _preemptible(preemptible)
    , _memory_to_release(memory_to_release)