logalloc: prevent false positives in reclaim_timer

reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).

Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.

Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.

Fixes #10981

Closes #11680
This commit is contained in:
Michał Chojnowski
2022-09-30 20:15:30 +02:00
committed by Avi Kivity
parent 372eadf542
commit 4563cbe595
2 changed files with 34 additions and 1 deletions

View File

@@ -32,6 +32,12 @@ struct coarse_steady_clock {
clock_gettime(CLOCK_MONOTONIC_COARSE, &tp);
return time_point(std::chrono::seconds(tp.tv_sec) + std::chrono::nanoseconds(tp.tv_nsec));
};
static duration get_resolution() noexcept {
timespec tp;
clock_getres(CLOCK_MONOTONIC_COARSE, &tp);
return std::chrono::seconds(tp.tv_sec) + std::chrono::nanoseconds(tp.tv_nsec);
}
};
};

View File

@@ -1094,6 +1094,12 @@ public:
struct reclaim_timer {
using extra_logger = noncopyable_function<void(log_level)>;
private:
// CLOCK_MONOTONIC_COARSE is not quite what we want -- to look for stalls,
// we want thread time, not wall time. Wall time will give false positives
// if the process is descheduled.
// For this reason Seastar uses CLOCK_THREAD_CPUTIME_ID in its stall detector.
// Unfortunately, CLOCK_THREAD_CPUTIME_ID_COARSE does not exist.
// It's not an important problem, though.
using clock = utils::coarse_steady_clock;
struct stats {
occupancy_stats region_occupancy;
@@ -1427,7 +1433,28 @@ size_t segment_pool::segments_in_use() const noexcept {
}
reclaim_timer::reclaim_timer(const char* name, is_preemptible preemptible, size_t memory_to_release, size_t segments_to_release, tracker::impl& tracker, segment_pool& segment_pool, extra_logger extra_logs)
: _duration_threshold(engine().get_blocked_reactor_notify_ms())
: _duration_threshold(
// We only report reclaim stalls when their measured duration is
// bigger than the threshold by at least one measurement error
// (clock resolution). This prevents false positives.
//
// Explanation for the 10us: The clock value is not always an
// integral multiply of its resolution. In the case of coarse
// clocks, resolution only describes the frequency of syncs with
// the hardware clock -- no effort is made to round the values to
// resolution. Therefore, tick durations vary slightly in both
// directions. We subtract something slightly bigger than these
// variations, to accomodate blocked-reactor-notify-ms values which
// are multiplies of resolution.
// E.g. with kernel CONFIG_HZ=250, coarse clock resolution is 4ms.
// If also we also have blocked-reactor-notify-ms=4, then we would
// like to report two-tick stalls, since they have durations of
// 4ms-8ms. But two-tick durations can be just slightly smaller
// than 8ms (e.g. 7999us) due to the inaccuracy. So we set the
// threshold not to (blocked_reactor_notify_ms + resolution) = 8000us,
// but to (blocked_reactor_notify_ms + resolution - 10us) = 7990us,
// to account for this.
engine().get_blocked_reactor_notify_ms() + std::max(0ns, clock::get_resolution() - 10us))
, _name(name)
, _preemptible(preemptible)
, _memory_to_release(memory_to_release)