Go to file

Pavel Emelyanov 3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes

The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs

2025-12-15 15:05:19 +03:00

.github

docs: fix local build

2025-12-14 11:48:48 +02:00

abseil @ d7aaad83b4

…

alternator

alternator: require rf_rack_valid_keyspaces when creating index

2025-12-15 10:36:57 +02:00

api

Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"

2025-12-12 03:55:13 +00:00

audit

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

auth

Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski

2025-12-14 14:01:01 +02:00

bin

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

cdc

Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek

2025-11-30 07:20:22 +01:00

cmake

build: disable the -fextend-variable-liveness clang option

2025-10-21 10:47:34 +03:00

compaction

Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho

2025-11-26 06:51:30 +02:00

conf

Revert "Merge 'db/config: enable ms sstable format by default' from Michał Chojnowski"

2025-12-02 14:38:56 +02:00

cql3

vector_search: throw an error when we restrict primary in vector search

2025-12-15 09:45:56 +02:00

data_dictionary

data_dictionary: table: add get_truncation_time()

2025-12-02 14:21:25 +02:00

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

debug

…

dht

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

dist

scylla-node-exporter: Add ethtool to node exporter

2025-12-08 14:27:10 +02:00

docs

docs: fix local build

2025-12-14 11:48:48 +02:00

ent

Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk

2025-12-05 11:36:50 +02:00

exceptions

exceptions.hh: fix message argument passing

2025-08-13 13:39:52 +02:00

gms

Revert "repair: Add tablet repair progress report support"

2025-12-11 12:18:11 +02:00

idl

direct_failure_detector: pass timeout to direct_fd_ping verb

2025-12-02 14:55:20 +02:00

index

vector_index: require tablets for vector indexes

2025-11-26 13:30:43 +02:00

keys

api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'

2025-10-01 15:53:25 +02:00

lang

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

licenses

utils: license: import crypt_sha512.c from musl to the project

2025-12-10 15:36:18 +01:00

locator

locator/token_metadata: Remove get_host_id()

2025-12-15 10:36:52 +01:00

message

direct_failure_detector: run direct failure detector in the gossiper scheduling group

2025-12-04 11:35:43 +02:00

mutation

cache, mvcc: Preempt cache update when applying range tombstone from memtable

2025-12-06 13:45:35 +01:00

mutation_writer

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

node_ops

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

pgo

Update pgo profiles - aarch64

2025-12-15 05:16:31 +02:00

query

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

raft

code: Stop using seastar::compat::source_location

2025-11-27 19:10:11 +02:00

readers

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

reloc

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

repair

Revert "repair: Add tablet repair progress report support"

2025-12-11 12:18:11 +02:00

replica

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

rust

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

schema

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

scripts

docs: fix local build

2025-12-14 11:48:48 +02:00

seastar @ 7ec14e836a

Update seastar submodule

2025-12-03 10:55:47 +03:00

service

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

sstables

Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"

2025-12-12 03:55:13 +00:00

streaming

Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk

2025-11-24 06:37:27 +02:00

swagger-ui @ 12f1da1082

…

tasks

Merge 'compaction: handle exception in expected_total_workload' from Aleksandra Martyniuk

2025-09-17 15:10:19 +03:00

test

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

tools

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

tracing

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

transport

transport: remove redundant futurize_invoke from counted data sink and source

2025-12-11 10:32:16 +03:00

types

fix rjson::value to bytes conversion with missing GetStringLength call

2025-12-09 19:27:22 +01:00

unified

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

utils

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

vector_search

fix rjson::value to string_view conversion with missing GetStringLength call

2025-12-09 19:27:21 +01:00

.clang-format

…

.dockerignore

…

.gitattributes

…

.gitignore

.gitignore: add rust target

2025-08-19 13:09:18 +03:00

.gitmodules

build: replace tools/java submodule with packaged cassandra-stress

2025-04-15 10:11:28 +03:00

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

…

absl-flat_hash_map.hh

…

amplify.yml

…

backlog_controller_fwd.hh

db/config: introduce new config parameter compaction_max_shares

2025-11-24 12:52:29 -03:00

backlog_controller.hh

db/config: introduce new config parameter compaction_max_shares

2025-11-24 12:52:29 -03:00

build_mode.hh

…

bytes_fwd.hh

…

bytes_ostream.hh

treewide: Replace __builtin_expect with (un)likely

2025-07-03 13:34:04 +03:00

bytes.cc

…

bytes.hh

bytes: adapt fmt_hex to std::span<const std::byte>

2025-04-01 00:07:27 +02:00

cartesian_product.hh

…

client_data.cc

…

client_data.hh

…

clocks-impl.cc

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

clocks-impl.hh

…

CMakeLists.txt

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

configure.py

test/boost: coroutinize auth_passwords_test

2025-12-10 15:36:18 +01:00

CONTRIBUTING.md

docs: fix typos and spelling errors

2025-09-30 13:16:49 +02:00

coverage_excludes.txt

…

coverage_sources.list

…

db_clock.hh

…

debug.cc

gdb: protect debug::the_database from lto

2025-01-23 22:26:04 +02:00

debug.hh

gdb: protect debug::the_database from lto

2025-01-23 22:26:04 +02:00

default.nix

…

Doxyfile

…

encoding_stats.hh

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

enum_set.hh

auth: add possibilty to check for any permission in set

2025-10-03 16:55:57 +02:00

exported_templates.cc

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

exported_templates.hh

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

fix_system_distributed_tables.py

…

flake.lock

…

flake.nix

…

gc_clock.hh

…

gdbinit

…

gen_segmented_compress_params.py

compress: move compress.cc/hh to sstables/compressor

2025-07-31 13:10:41 +03:00

HACKING.md

docs: fix typos and spelling errors

2025-09-30 13:16:49 +02:00

hashing_partition_visitor.hh

…

idl-compiler.py

idl-compiler.py: generate skip() definition for enums serializers

2025-06-24 11:05:31 +03:00

inet_address_vectors.hh

storage_proxy: handle node_local_only in mutate

2025-07-24 19:48:08 +02:00

init.cc

db: experimental consistent-tablets option

2025-10-15 11:27:10 +03:00

init.hh

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

install-dependencies.sh

build: update toolchain to Fedora 43 with clang 21.1.6

2025-12-09 15:16:31 +02:00

install.sh

main: replace p11-kit hack for trust paths override with gnutls hack

2025-12-04 11:33:51 +02:00

LICENSE-ScyllaDB-Source-Available.md

Fix typos

2025-02-13 01:54:08 +02:00

main.cc

Merge 'batchlog: make replay (flush) faster' from Botond Dénes

2025-12-15 15:05:19 +03:00

marshal_exception.hh

…

mutation_query.cc

…

mutation_query.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

NOTICE.txt

PowerPC: remove ppc stuff

2025-07-08 10:38:23 +03:00

ORIGIN

…

partition_builder.hh

mutation: async_utils: add unfreeze_and_split_gently

2025-09-30 17:15:41 +03:00

partition_range_compat.hh

treewide: Move misc files to utils directory

2025-07-21 11:56:40 +03:00

partition_slice_builder.cc

tree: Remove unused boost headers

2025-02-25 10:32:32 +03:00

partition_slice_builder.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

partition_snapshot_reader.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

query_ranges_to_vnodes.cc

interval: rename start_ref() back to start() (and end_ref() etc).

2025-06-14 21:26:16 +03:00

query_ranges_to_vnodes.hh

…

reader_concurrency_semaphore_group.cc

…

reader_concurrency_semaphore_group.hh

tree: Remove unused boost headers

2025-02-15 20:32:22 +02:00

reader_concurrency_semaphore.cc

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

reader_concurrency_semaphore.hh

reader_concurrency_semaphore: use named gate

2025-04-12 11:28:48 +03:00

reader_permit.hh

reader_permit: mark check_abort() as const

2025-02-07 01:32:35 -05:00

README.md

docs: fix typos and spelling errors

2025-09-30 13:16:49 +02:00

real_dirty_memory_accounter.hh

moved cache files to db

2025-02-04 12:21:31 +03:00

release.cc

release: adjust doc_link() for the post source-available world

2025-09-29 17:02:55 +03:00

release.hh

…

reversibly_mergeable.hh

…

schema_upgrader.hh

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

scylla_post_install.sh

…

scylla-gdb.py

gdb: simplify and future-proof looking up coroutine frame type

2025-10-20 12:38:53 +03:00

SCYLLA-VERSION-GEN

Update ScyllaDB version to: 2026.1.0-dev

2025-09-30 18:54:09 +03:00

seastarx.hh

…

serialization_visitors.hh

…

serializer_impl.hh

serializer_impl.hh: add as_input_stream(managed_bytes_view) overload

2025-05-13 10:32:32 +02:00

serializer.cc

…

serializer.hh

treewide: include boost headers as "system" headers

2025-08-22 17:21:24 +03:00

service_permit.hh

…

shell.nix

…

sstable_dict_autotrainer.cc

storage_service: hold group0 gate in publish_new_sstable_dict

2025-07-28 12:42:37 +02:00

sstable_dict_autotrainer.hh

dict_autotrainer: introduce sstable_dict_autotrainer

2025-04-01 00:07:30 +02:00

sstables_loader.cc

Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky

2025-12-09 10:53:57 +03:00

sstables_loader.hh

streaming: refactor get_sstables_for_tablets to make it accessible

2025-12-08 12:30:23 +02:00

stdafx.cc

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

stdafx.hh

code: Stop using seastar::compat::source_location

2025-11-27 19:10:11 +02:00

supervisor.hh

…

table_helper.cc

schema: Allow configuring consistency setting for a keyspace

2025-10-16 13:34:49 +03:00

table_helper.hh

…

test.py

test.py: set worksteal distribution

2025-11-30 18:13:03 +02:00

timeout_config.cc

…

timeout_config.hh

…

tombstone_gc_extension.hh

schema: deprecate schema_extension

2025-03-19 20:36:16 +02:00

tombstone_gc_options.cc

…

tombstone_gc_options.hh

…

tombstone_gc-internals.hh

treewide: Add missing #pragma once

2025-09-01 14:58:21 +03:00

tombstone_gc.cc

tombstone_gc: don't use 'repair' mode for colocated tables

2025-11-25 09:15:46 +01:00

tombstone_gc.hh

tombstone_gc: don't use 'repair' mode for colocated tables

2025-11-25 09:15:46 +01:00

ubsan-suppressions.supp

…

unimplemented.cc

…

unimplemented.hh

…

validation.cc

treewide: Move keys related files to a new keys directory

2025-07-25 10:45:32 +03:00

validation.hh

…

version.hh

…

view_info.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

vint-serialization.cc

treewide: Replace __builtin_expect with (un)likely

2025-07-03 13:34:04 +03:00

vint-serialization.hh

…

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain. This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.7%

Python 26.1%

CMake 0.3%

GAP 0.3%

Shell 0.3%