scylla

Author	SHA1	Message	Date
Kefu Chai	0ae81446ef	./: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16766	2024-01-17 16:30:14 +02:00
Michał Chojnowski	47299d6b06	partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads The FIXME comment claims that setting continity isn't very important in this place, but in fact this is just wrong. If two calls to read_from_underlying() get into a race, the one which finishes later can call ensure_entry_in_latest() on a position which lies inside a continuous interval in the newest version. If we don't take care to preserve the total continuity of the version, this can punch a hole in the continuity of the newest version, potentially reverting the affected interval to an older version. Fix that.	2023-11-16 19:01:18 +01:00
Alexey Novikov	ca4e7f91c6	compact and remove expired rows from cache on read when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917	2023-06-26 15:29:01 +02:00
Michał Chojnowski	d56b0c20f4	cache_flat_mutation_reader: use the correct schema in prepare_hash Since `mvcc: make schema upgrades gentle` (`51e3b9321b`), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. Fixes #14110	2023-06-19 22:50:43 +02:00
Michał Chojnowski	5f68409934	partition_snapshot_row_cursor: handle multi-schema snapshots To support gentle schema upgrades, each version has its own schema. Currently this facility is unused, and the schema is equal for all versions in a snapshot. But in upcoming commits this will change. In the new design, after an entry upgrade, there will be a transitional period where two versions with different schemas will coexist in a snapshot. Eventually, these versions will be merged by mutation_cleaner into one version with the current schema, but until then reads have to merge multi-schema snapshots on the fly. This commit implements in the cursor support for per-version schemas.	2023-05-04 02:37:29 +02:00
Kefu Chai	1cb95b8cff	mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `tombstone` and `shadowable_tombstone` without the help of `operator<<`. in this change, only `operator<<(ostream&, const shadowable_tombstone&)` is dropped, and all its callers are now using fmtlib for formatting the instances of `shadowable_tombstone` now. `operator<<(ostream&, const tombstone&)` is preserved. as it is still used by Boost::test for printing the operands in case the comparing tests fail. please note, before this change we were using a concrete string for indent. after this change, some of the places are changed to using fmtlib for indent. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-12 10:57:03 +08:00
Kefu Chai	76dde9fd50	partition_snapshot_row_cursor: do not use operator<< when printing position in order to prepare for dropping the `operator<<()` for `position_in_partition_view`, let's use fmtlib to print `position()`. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-31 19:03:14 +08:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Tomasz Grabiec	026f8cc1e7	db: Use mutation_partition_v2 in mvcc This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes #2578 Fixes #3288 Fixes #10587	2023-01-27 21:56:28 +01:00
Tomasz Grabiec	6b7473be53	mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots This is a prerequisite for using the cursor in memtable readers. Non-evictable snapshots are those which live in memtables. Unlike evictable snapshots, they don't have a dummy entry at position after all clustering rows. In evictable snapshots, lookup always finds an entry, not so with non-evictable snapshots. The cursor was not prepared for this case, this patch handles it.	2023-01-27 19:15:39 +01:00
Tomasz Grabiec	091ad8f6ee	mvcc: partition_snapshot_row_cursor: Support digest calculation Prerequisite for using in memtable reader.	2023-01-27 19:15:39 +01:00
Tomasz Grabiec	a6a61eaf96	row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy Scenario: cache = [ row(pos=2, continuous=false), row(pos=after(2), dummy=true) ] Scanning read starts, starts populating [-inf, before(2)] from sstables. row(pos=2) is evicted. cache = [ row(pos=after(2), dummy=true) ] Scanning read finishes reading from sstables. Refreshes cache cursor via partition_snapshot_row_cursor::maybe_refresh(), which calls partition_snapshot_row_cursor::advance_to() because iterators are invalidated. This advances the cursor to after(2). no_clustering_row_between(2, after(2)) returns true, so advance_to() returns true, and maybe_refresh() returns true. This is interpreted by the cache reader as "the cursor has not moved forward", so it marks the range as complete, without emitting the row with pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads will also miss the row. The bug is in advance_to(), which is using no_clustering_row_between(a, b) to determine its result, which by definition excludes the starting key. Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction with reduced key range in the random_mutation_generator (1024 -> 16). Fixes #11239	2022-08-09 02:28:56 +02:00
Tomasz Grabiec	a58fee1dcf	partition_snapshot_row_cursor: Fix over-counting of rows insert_before() may need to allocate memory for a btree, so may fail. Call cache_tracker::insert() only after successful instance so that row counters reflect the correct state. On failure, the entry will be unlinked automatically by rows_entry destructor, but row counters in the cache_tracker will not be automatically decremented.	2022-08-02 11:02:22 +02:00
Michał Chojnowski	5570354f44	partition_snapshot_row_cursor: construct the clustering_row directly in row() Currently row() creates an empty clustering_row, then applies deletable_rows from the cursor to the empty clustering_row. But the apply logic is unnecessary for the first apply(), and it's cheaper to simply copy the row.	2022-06-20 15:45:19 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Tomasz Grabiec	d678890757	row_cache: partition_snapshot_row_cursor: Print more details about the current version vector Now the format is the same as for the "heap" version vector. Contains positions and continuity flags. Helps in debugging. Before: {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[0], heap=[ ], latest_iterator=[{position: clustered,ckp{...},-1}]} After: {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[{v=0, pos={position: clustered,ckp{...},-1}, cont=false}], heap=[ ], latest_iterator=[{position: clustered,ckp{...},-1}]}	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	63351483f0	row_cache: Support reverse reads natively Some implementation notes below. When iterating in reverse, _last_row is after the current entry (_next_row) in table schema order, not before like in the forward mode. Since there is no dummy row before all entries, reverse iteration must be now prepared for the fact that advancing _next_row may land not pointing at any row. The partition_snapshot_row_cursor maintains continuity() correctly in this case, and positions the cursor before all rows, so most of the code works unchanged. The only excpetion is in move_to_next_entry(), which now cannot assume that failure to advance to an entry means it can end a read. maybe_drop_last_entry() is not implemented in reverse mode, which may expose reverse-only workload to the problem of accumulating dummy entries. ensure_population_lower_bound() was not updating _last_row after inserting the entry in latets version. This was not a problem for forward reads because they do not modify the row in the partition snapshot represented by _last_row. They only need the row to be there in the latest version after the call. It's different for reveresed reads, which change the continuity of the entry represented by _last_row, hence _last_row needs to have the iterator updated to point to the entry from the latest version, otherwise we'd set the continuity of the previous version entry which would corrupt the continuity.	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	757fc1275f	partition_snapshot_row_cursor: Support reverse iteration	2021-12-19 22:41:35 +01:00
Pavel Emelyanov	ee103636ac	row-cache: Handle exception (un)safety of rows_entry insertion The B-tree's insert_before() is throwing operation, its caller must account for that. When the rows_entry's collection was switched on B-tree all the risky places were fixed by `ee9e1045`, but few places went under the radar. In the cache_flat_mutation_reader there's a place where a C-pointer is inserted into the tree, thus potentially leaking the entry. In the partition_snapshot_row_cursor there are two places that not only leak the entry, but also leave it in the LRU list. The latter it quite nasty, because those entry can be evicted, eviction code tries to get rows_entry iterator from "this", but the hook happens to be unattached (because insertion threw) and fails the assert. fixes: #9728 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	9fd8db318d	partition_snapshot_row_cursor: Shuffle ensure_result creation Both places get the C-pointer on the freshly allocated rows_entry, insert it where needed and return back the dereferenced pointer. The C-pointer is going to become smart-pointer that would go out of scope before return. This change prepares for that by constructing the ensure_result from the iterator, that's returned from insertion of the entry. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	05b8cdfd24	mutation_partition: Return immutable collection for rows Patch the .clustered_rows() method to return the btree of rows wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the elements in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	ad27bf40e6	mutation_partition: Pin mutable access to rows Some callers of mutation_partition::clustered_rows() don't want (and shouldn't) modify the tree of rows, while they may want to modify the rows themselves. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	6ef27c9fa1	btree: Make iterators not modify the tree itself The const_iterator cannot modify anything, but the plain iterator has public methods to remove the key from the tree. To control how the tree is modified this method must be marked private and modification by iterator should come from somewhere else. This somewhere else is the existing key_grabber that's already used to move keys between trees. Generalize this ability to move a key out of a tree (i.e. -- erase). Once done -- mark the iterator::erase_and_dispose private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Pavel Emelyanov	89eece3aca	partition_snapshot_row_cursor: Rewrite row() with consume_row() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	ae6b677f9a	partition_snapshot_row-cursor: Add const consume_row() version It's the same as the existing one, but doesn't modify anything (cursor and pointing rows_entry's) and calls consumer with const row reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	5e28075ec0	partition_snapshot_row_cursor: Add concept to .consume_row() Nothing special here Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	d891cfe6cd	partition_snapshot_row_cursor: Don't carry end iterators The btree's iterator can be checked to reach the tree's end without holding the ending iterator itself. This makes the whole p_s_r_c 20% smaller (288 bytes -> 224 bytes) since it now keeps 4 extra iterators on-board -- inside small vectors for heap and current_row. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	4558eb3afc	partition_snapshot_row_cursor: Move cells hash creation to reader Right now call to .row() method may create hash on row's cells. It's counterintuitive to see a const method that transparently changes something it points to. Since the only caller of a row() who knows whether the hash creation is required is the cache reader, it's better to move the call to prepare_hash() into it. Other than making the .row() less surprising this also helps to get rid of the whole method by the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	00caf5f219	partition_snapshot_row_cursor: Move read_partition into test The method in question is test-only helper, there's no need in keeping it as a part of the API. Another reason to move is that the method is O(number of rows) and doesn't preempt while looping, but cursor code users try hard not to stall the reactor. So even though this method has a meaningful semantics within the class, it will better be reinvented if needed in core code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:16:13 +03:00
Pavel Emelyanov	9f323355a6	partition_snapshot_row_cursor: Move is_in_latest_version inline The method is currently defined outside of the class which gives compiler less chances to really inline it when needed. Also, keeping this simple piece of code inline is less code to read (and compile). Mark the guy noexcept while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	cc57e35c6a	partition_snapshot_row_cursor: Use is_in_latest_version where appropriate Checking for _current_row[0].version being 0 (or not being 0) is better understood if done with a well named existing helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	353a8f66a2	partition_snapshot_row_cursor: Less dereferences in key() method The valid cursor's key is kept on the _position as well, but getting it from there is 1 defererence less: _current_row -()-> row -> key _position -()-> std::optional -> key iterator's -> is pointer dereference ** std::optional is designed not to be a pointer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	353a1306ce	partition_snapshot_row_cursor: Update change mark in prepare_heap The heap's iterators validity is checked with the change mark, which is updated every time heap is recreated. Factor these updates out and keep the mark together with the heap it protects. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	1a1f05f50b	partition_snapshot_row_cursor: Clear current row when recreating The cursor keeps current row in a separate vector of iterators and reconstructs it in a dedicated method, which _expects_ that the vector is empty on entry. It's better to keep the logic of current row construction in one place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	2edd072d27	partition_snapshot_row_cursor: Use btree::lower_bound sugar When checking if the lower-bound entry matched the search key it's possible to avoid extra comparison with the help of the collection used to store the rows (btree). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	9aee0ad8b3	partition_snapshot_row_cursor: Factor out next() and erase_and_advance() Both helpers do the same -- advance the cursor to the next row. The latter may additionally remove the row from the uniquely owned version. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	2fb0f7315c	partition_snapshot_row_cursor: Relax vector of iterators The cursor maintains a vector of iterators that correspond to each of the versions scanned. However, only the iterator in the latest one is really needed, so the whole vector can be reduced down to an optional. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Tomasz Grabiec	cb0b8d1903	row_cache: Zap dummy entries when populating or reading a range This will prevent accumulation of unnecessary dummy entries. A single-partition populating scan with clustering key restrictions will insert dummy entries positioned at the boundaries of the clustering query range to mark the newly populated range as continuous. Those dummy entries may accumulate with time, increasing the cost of the scan, which needs to walk over them. In some workloads we could prevent this. If a populating query overlaps with dummy entries, we could erase the old dummy entry since it will not be needed, it will fall inside a broader continuous range. This will be the case for time series worklodas which scan with a decreasing (newest) lower bound. Refs #8153. _last_row is now updated atomically with _next_row. Before, _last_row was moved first. If exception was thrown and the section was retried, this could cause the wrong entry to be removed (new next instead of old last) by the new algorithm. I don't think this was causing problems before this patch. The problem is not solved for all the cases. After this patch, we remove dummies only when there is a single MVCC version. We could patch apply_monotonically() to also do it, so that dummies which are inside continuous ranges are eventually removed, but this is left for later. perf_row_cache_reads output after that patch shows that the second scan touches no dummies: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 265320 Scanning read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB] read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB] Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>	2021-03-01 20:34:35 +02:00
Pavel Emelyanov	4ccce97396	partition_snapshot_row_cursor: Remove rows pointer The pointer is needed to erase an element by its iterator from the rows container. The B-tree has this method on iterator and it does NOT need to walk up the tree to find its root, so the complexity is still amortized constant. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	5c0f9a8180	mutation_partition: Switch cache of rows onto B-tree The switch is pretty straightforward, and consists of - change less-compare into tri-compare - rename insert/insert_check into insert_before_hint - use tree::key_grabber in mutation_partition::apply_monotonically to exception-safely transfer a row from one tree to another - explicitly erase the row from tree in rows_entry::on_evicted, there's a O(1) tree::iterator method for this - rewrite rows_entry -> cache_entry transofrmation in the on_evicted to fit the B-tree API - include the B-tree's external memory usage into stats That's it. The number of keys per node was is set to 12 with linear search and linear extention of 20 because - experimenting with tree shows that numbers 8 through 10 keys with linear search show the best performance on stress tests for insert/find-s of keys that are memcmp-able arrays of bytes (which is an approximation of current clustring key compare). More keys work slower, but still better than any bigger value with any type of search up to 64 keys per node - having 12 keys per nodes is the threshold at which the memory footprint for B-tree becomes smaller than for boost::intrusive::set for partitions with 32+ keys - 20 keys for linear root eats the first-split peak and still performs well in linear search As a result the footpring for B tree is bigger than the one for BST only for trees filled with 21...32 keys by 0.1...0.7 bytes per key. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Botond Dénes	54357221f0	partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment It is what its callers want anyway.	2020-09-28 10:53:56 +03:00
Botond Dénes	4f5ccf82cb	mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/ We will soon want to update the memory consumption of mutation fragment after each modification done to it, to do that safely we have to forbid direct access to the underlying data and instead have callers pass a lambda doing their modifications. Uses where this method was just used to move the fragment away are converted to use `as_clustering_row() &&`.	2020-09-28 10:53:56 +03:00
Pavel Emelyanov	ca148acbf9	deletable_row: Do not mess with clustering_row The deletable_row accepts clustering_row in constructor and .apply() method. The next patch will make clustering_row embed the deletable_row inside, so those two methods will violate layering and should be fixed in advance. The fix is in providing a clustering_row method to convert itself into a deletable_row. There are two places that need this: mutation_fragment_applier and partition_snapshot_row_cursor. Both methods pass temporary clustering_row value, so the method in question is also move-converter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-08 22:18:15 +03:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Avi Kivity	78182a704b	partition_snapshot_row_cursor: initialize _dummy and _continuous Debug mode view_schema_test sometimes complains that a bool member doesn't contain in-range values, apparenty in the move constructor. Initialize them for its benefit to avoid false-positive test failures. Message-Id: <20180602184934.31258-1-avi@scylladb.com>	2018-06-02 19:51:36 +01:00
Paweł Dziepak	ec9d166a4f	treewide: require type to compute cell memory usage	2018-05-31 15:51:11 +01:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	4561e97efe	mvcc: Use small_vector<> in partition_snapshot_row_cursor I measured 8% improvement in cache update throughput for small partitions.	2018-05-30 14:41:41 +02:00

1 2

87 Commits