partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry

In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.

This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.

However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.

The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.

Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.

The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.

The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
  - Mutation cleaner merging versions in the background (changes change_mark)
  - LSA segment compaction during reserve() (invalidates references)
  - B-tree rebalancing on partition insertion (invalidates references)
  - Debug mode's always-true need_preempt() creating many multi-version
    partitions via preempted apply_monotonically()

A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.

Fixes: SCYLLADB-1253

Closes scylladb/scylladb#29368

(cherry picked from commit 21d9f54a9a)

Closes scylladb/scylladb#29480
This commit is contained in:
Avi Kivity
2026-04-07 10:53:50 +03:00
parent e436db01e3
commit 196db8931e
2 changed files with 121 additions and 1 deletions

View File

@@ -461,7 +461,17 @@ public:
}
}
} else {
_background_continuity = true; // Default continuity
if (_reversed) [[unlikely]] {
if (!rows.empty()) {
it = std::prev(rows.end());
cont = is_continuous::yes;
rt = {};
} else {
_background_continuity = true;
}
} else {
_background_continuity = true;
}
}
if (!it) {

View File

@@ -1990,6 +1990,116 @@ SEASTAR_TEST_CASE(test_reverse_cursor_refreshing_on_nonevictable_snapshot_with_e
});
}
// Reproducer for SCYLLADB-1253 (https://github.com/scylladb/scylladb/issues/18732)
// A reversed query with many overlapping range tombstones and a single live row
// near the end of the tombstone range fails to return the live row.
//
// The bug is in partition_snapshot_row_cursor::maybe_refresh(), in the
// !is_in_latest_version() path. When the cursor is reversed and positioned
// above all entries in the latest version (in table order), it incorrectly
// removes the latest version's entry from the heap, causing the live row
// to be skipped.
//
// This test creates a multi-version partition snapshot:
// - v0 (older): contains overlapping range tombstones
// - v1 (latest): contains the live row with a higher timestamp
// and directly exercises the cursor to verify that advance+maybe_refresh
// correctly keeps the live row in the heap during reversed traversal.
SEASTAR_TEST_CASE(test_reversed_maybe_refresh_keeps_latest_version_entry) {
return seastar::async([] {
logalloc::region region;
mutation_application_stats app_stats;
mutation_cleaner cleaner(region, no_cache_tracker, app_stats);
simple_schema ss(simple_schema::with_static::no);
auto s = ss.schema();
auto rev_s = s->make_reversed();
const int num_tombstones = 100;
const int range_span = num_tombstones; // each range covers [i, i + range_span)
const int row_ck = 5; // below all range tombstone boundary entries
// Step 1: Create a partition_entry with all range tombstones.
auto pe_ptr = with_allocator(region.allocator(), [&] {
logalloc::allocating_section as;
return as(region, [&] () -> std::unique_ptr<partition_entry> {
mutation m(s, ss.make_pkey(0));
for (int i = 0; i < num_tombstones; ++i) {
auto range = query::clustering_range::make(
query::clustering_range::bound(ss.make_ckey(i), true),
query::clustering_range::bound(ss.make_ckey(i + range_span), false));
ss.delete_range(m, range);
}
return std::make_unique<partition_entry>(*s, std::move(m.partition()));
});
});
// Step 2: Take a snapshot to pin the current version (v0 with tombstones).
auto snap1 = with_allocator(region.allocator(), [&] {
logalloc::allocating_section as;
return as(region, [&] {
return pe_ptr->read(region, cleaner, no_cache_tracker);
});
});
// Step 3: Apply the live row. Since v0 is pinned by snap1,
// this creates v1 (latest version) with just the live row.
with_allocator(region.allocator(), [&] {
logalloc::allocating_section as;
as(region, [&] {
mutation m(s, ss.make_pkey(0));
ss.add_row(m, ss.make_ckey(row_ck), "live_value");
pe_ptr->apply(region, cleaner, *s, m.partition(), *m.schema(), app_stats);
});
});
// Step 4: Take a second snapshot (sees both versions) and test cursor.
auto snap2 = with_allocator(region.allocator(), [&] {
logalloc::allocating_section as;
return as(region, [&] {
return pe_ptr->read(region, cleaner, no_cache_tracker);
});
});
{
logalloc::reclaim_lock rl(region);
partition_snapshot_row_cursor cursor(*rev_s, *snap2, false /* unique_owner */, true /* reversed */);
// Position cursor at the very end (in reversed/query order, this means
// the highest table-order position).
cursor.maybe_advance_to(position_in_partition_view::before_all_clustered_rows());
bool has_row = cursor.at_a_row();
// Traverse all entries in reversed order, calling maybe_refresh()
// before processing each row. This simulates what
// partition_snapshot_reader::next_interval() does and is where
// the bug manifests.
bool found_live_row = false;
while (has_row) {
cursor.maybe_refresh();
if (!cursor.dummy()) {
found_live_row = true;
break;
}
has_row = cursor.next();
}
BOOST_REQUIRE_MESSAGE(found_live_row,
fmt::format("Reversed cursor failed to find the live row at ck={}. "
"The !is_in_latest_version() path in maybe_refresh() "
"incorrectly removed the latest version's entry from the heap.",
row_ck));
}
// Cleanup
snap2 = {};
snap1 = {};
with_allocator(region.allocator(), [&] {
pe_ptr.reset();
});
});
}
SEASTAR_TEST_CASE(test_apply_to_incomplete_with_dummies) {
return seastar::async([] {