Go to file

Avi Kivity 4d9271df98 Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.

This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.

(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).

The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.

This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.

Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.

`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                70       18.0     18        128
500001  1       1               647       19.0     19        132
0       1000000 1               748       15.0     15        116
0       500000  2               372       29.0     29        284
0       250000  4               227       56.0     56        504
0       125000  8               116      106.0    106        928
0       62500   16               67      195.0    195       1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                51        5.1      5         20
500001  1       1                64        5.3      5         20
0       1000000 1               679        4.0      4         16
0       500000  2               492        8.0      8         88
0       250000  4               804       16.0     16        232
0       125000  8               409       31.0     31        516
0       62500   16               97       54.0     54       1056
```

Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.

External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.

New functionality, no backport needed.

Closes scylladb/scylladb#26215

* github.com:scylladb/scylladb:
  test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
  test/cluster: add test_bti_index.py
  test: prepare bypass_cache_test.py for `ms` sstables
  sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
  test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
  tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
  db/config: expose "ms" format to the users via database config
  test: in Python tests, prepare some sstable filename regexes for `ms`
  sstables: add `ms` to `all_sstable_versions`
  test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
  test/lib/index_reader_assertions: skip some row index checks for BTI indexes
  test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
  test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
  test/resource: add `ms` sample sstable files for relevant tests
  test/boost/sstable_compaction_test: prepare for `ms` sstables.
  test/boost/index_reader_test: prepare for `ms` sstables
  test/boost/bloom_filter_tests: prepare for `ms` sstables
  test/boost/sstable_datafile_test: prepare for `ms` sstables
  test/boost/sstable_test: prepare for `ms` sstables.
  sstables: introduce `ms` sstable format version
  tools/scylla-sstable: default to "preferred" sstable version, not "highest"
  sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
  sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
  sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
  sstables/mx: make Index and Summary components optional
  sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
  sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
  sstables: implement an alternative way to rebuild bloom filters for sstables without Index
  utils/bloom_filter: add `add(const hashed_key&)`
  sstables: adapt estimated_keys_for_range to sstables without Summary
  sstables: make `sstable::estimated_keys_for_range` asynchronous
  sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
  replica/database: add table::estimated_partitions_in_range()
  sstables/mx: implement sstable::has_partition_key using a regular read
  sstables: use BTI index for queries, when present and enabled
  sstables/mx/writer: populate BTI index files
  sstables: create and open BTI index files, when enabled
  sstables: introduce Partition and Rows component types
  sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`

2025-09-30 09:40:02 +03:00

.github

Update urgent_issue_reminder.yml - run daily

2025-09-25 11:05:51 +03:00

abseil @ d7aaad83b4

…

alternator

treewide: Move transport related files to a transport directory As requested in #22112 , moved the files and fixed other includes and build system.

2025-09-29 11:46:06 +03:00

api

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

audit

Revert "build: add precompiled headers to CMakeLists.txt"

2025-09-03 09:46:00 +03:00

auth

auth: mark some auth-v1 functions as legacy

2025-09-26 14:40:53 +02:00

bin

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

cdc

compaction: move code to namespace compaction

2025-09-25 15:03:56 +03:00

cmake

cmake: introduce Scylla_WITH_DEBUG_INFO option

2025-09-25 12:49:36 +03:00

compaction

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

conf

vector_store_client: Rename vector_store_uri to vector_store_primary_uri

2025-09-21 16:33:10 +03:00

cql3

cmake: link vector_search to test-lib instead of cql3

2025-09-29 17:46:58 +03:00

data_dictionary

Revert "build: add precompiled headers to CMakeLists.txt"

2025-09-03 09:46:00 +03:00

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

debug

…

dht

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

dist

packaging: Add adduser as dependnacy

2025-09-10 21:51:25 +03:00

docs

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

ent

sstables: implement an alternative way to rebuild bloom filters for sstables without Index

2025-09-29 13:01:21 +02:00

exceptions

exceptions.hh: fix message argument passing

2025-08-13 13:39:52 +02:00

gms

db/config: expose "ms" format to the users via database config

2025-09-29 22:15:25 +02:00

idl

raft: refactor can_vote logic and type

2025-09-24 13:55:05 +02:00

index

treewide: Move type related files to a type directory As requested in #22110 , moved the files and fixed other includes and build system.

2025-09-17 17:32:19 +03:00

keys

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

lang

treewide: Move type related files to a type directory As requested in #22110 , moved the files and fixed other includes and build system.

2025-09-17 17:32:19 +03:00

licenses

…

locator

Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev

2025-09-30 07:15:16 +02:00

message

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

mutation

compaction: move code to namespace compaction

2025-09-25 15:03:56 +03:00

mutation_writer

replica: Fix split compaction when tablet boundaries change

2025-09-07 05:20:23 -03:00

node_ops

storage_service: change node_ops_info::ignore_nodes to host id

2025-09-15 10:18:24 +02:00

pgo

Update pgo profiles - aarch64

2025-09-15 05:17:07 +03:00

query

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

raft

raft: refactor can_vote logic and type

2025-09-24 13:55:05 +02:00

readers

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

reloc

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

repair

sstables: make sstable::estimated_keys_for_range asynchronous

2025-09-29 13:01:21 +02:00

replica

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

rust

Revert "build: add precompiled headers to CMakeLists.txt"

2025-09-03 09:46:00 +03:00

schema

compaction: move code to namespace compaction

2025-09-25 15:03:56 +03:00

scripts

docs: expose alternator metrics

2025-08-22 09:49:52 +03:00

seastar @ c8a3515f9b

Update seastar submodule

2025-09-23 18:20:45 +03:00

service

Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev

2025-09-30 07:15:16 +02:00

sstables

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

streaming

sstables: make sstable::estimated_keys_for_range asynchronous

2025-09-29 13:01:21 +02:00

swagger-ui @ 12f1da1082

…

tasks

Merge 'compaction: handle exception in expected_total_workload' from Aleksandra Martyniuk

2025-09-17 15:10:19 +03:00

test

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

tools

tools/scylla-sstable: add --sstable-version=? to scylla sstable write

2025-09-29 22:15:25 +02:00

tracing

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

transport

treewide: Move transport related files to a transport directory As requested in #22112 , moved the files and fixed other includes and build system.

2025-09-29 11:46:06 +03:00

types

raft: refactor can_vote logic and type

2025-09-24 13:55:05 +02:00

unified

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

utils

Merge 'sstables: introduce sstable version ms' from Michał Chojnowski

2025-09-30 09:40:02 +03:00

vector_search

metrics, vector_search: add a dns refresh metric

2025-09-29 12:28:52 +02:00

.clang-format

…

.dockerignore

…

.gitattributes

…

.gitignore

.gitignore: add rust target

2025-08-19 13:09:18 +03:00

.gitmodules

build: replace tools/java submodule with packaged cassandra-stress

2025-04-15 10:11:28 +03:00

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

…

absl-flat_hash_map.hh

…

amplify.yml

…

backlog_controller.hh

…

build_mode.hh

…

bytes_fwd.hh

…

bytes_ostream.hh

treewide: Replace __builtin_expect with (un)likely

2025-07-03 13:34:04 +03:00

bytes.cc

…

bytes.hh

bytes: adapt fmt_hex to std::span<const std::byte>

2025-04-01 00:07:27 +02:00

cartesian_product.hh

…

client_data.cc

…

client_data.hh

…

clocks-impl.cc

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

clocks-impl.hh

…

CMakeLists.txt

treewide: Move transport related files to a transport directory As requested in #22112 , moved the files and fixed other includes and build system.

2025-09-29 11:46:06 +03:00

configure.py

treewide: Move transport related files to a transport directory As requested in #22112 , moved the files and fixed other includes and build system.

2025-09-29 11:46:06 +03:00

CONTRIBUTING.md

Fix typos

2025-02-11 00:17:43 +02:00

coverage_excludes.txt

…

coverage_sources.list

…

db_clock.hh

…

debug.cc

gdb: protect debug::the_database from lto

2025-01-23 22:26:04 +02:00

debug.hh

gdb: protect debug::the_database from lto

2025-01-23 22:26:04 +02:00

default.nix

…

Doxyfile

…

encoding_stats.hh

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

enum_set.hh

…

fix_system_distributed_tables.py

…

flake.lock

…

flake.nix

…

gc_clock.hh

…

gdbinit

…

gen_segmented_compress_params.py

compress: move compress.cc/hh to sstables/compressor

2025-07-31 13:10:41 +03:00

HACKING.md

build: replace tools/java submodule with packaged cassandra-stress

2025-04-15 10:11:28 +03:00

hashing_partition_visitor.hh

…

idl-compiler.py

idl-compiler.py: generate skip() definition for enums serializers

2025-06-24 11:05:31 +03:00

inet_address_vectors.hh

storage_proxy: handle node_local_only in mutate

2025-07-24 19:48:08 +02:00

init.cc

db/config: expose "ms" format to the users via database config

2025-09-29 22:15:25 +02:00

init.hh

code: Replace distributed<> with sharded<>

2025-09-19 12:22:51 +02:00

install-dependencies.sh

tools: toolchain: add e2fsprogs, fuse3 to the dependencies

2025-09-23 18:49:37 +03:00

install.sh

install.sh: simplify check_usermode_support()

2025-02-24 11:29:30 +03:00

LICENSE-ScyllaDB-Source-Available.md

Fix typos

2025-02-13 01:54:08 +02:00

main.cc

Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis

2025-09-28 20:23:23 +03:00

marshal_exception.hh

…

mutation_query.cc

…

mutation_query.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

NOTICE.txt

PowerPC: remove ppc stuff

2025-07-08 10:38:23 +03:00

ORIGIN

…

partition_builder.hh

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

partition_range_compat.hh

treewide: Move misc files to utils directory

2025-07-21 11:56:40 +03:00

partition_slice_builder.cc

tree: Remove unused boost headers

2025-02-25 10:32:32 +03:00

partition_slice_builder.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

partition_snapshot_reader.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

querier.cc

interval: rename start_ref() back to start() (and end_ref() etc).

2025-06-14 21:26:16 +03:00

querier.hh

treewide: Move keys related files to a new keys directory

2025-07-25 10:45:32 +03:00

query_ranges_to_vnodes.cc

interval: rename start_ref() back to start() (and end_ref() etc).

2025-06-14 21:26:16 +03:00

query_ranges_to_vnodes.hh

…

reader_concurrency_semaphore_group.cc

…

reader_concurrency_semaphore_group.hh

tree: Remove unused boost headers

2025-02-15 20:32:22 +02:00

reader_concurrency_semaphore.cc

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

reader_concurrency_semaphore.hh

reader_concurrency_semaphore: use named gate

2025-04-12 11:28:48 +03:00

reader_permit.hh

reader_permit: mark check_abort() as const

2025-02-07 01:32:35 -05:00

README.md

README: adjust to reflect license change

2025-01-30 10:28:32 +03:00

real_dirty_memory_accounter.hh

moved cache files to db

2025-02-04 12:21:31 +03:00

release.cc

…

release.hh

…

reversibly_mergeable.hh

…

schema_upgrader.hh

treewide: Move mutation related files to a mutation directory

2025-09-24 13:23:38 +03:00

scylla_post_install.sh

…

scylla-gdb.py

sstables: introduce ms sstable format version

2025-09-29 22:15:24 +02:00

SCYLLA-VERSION-GEN

Update ScyllaDB version to: 2025.4.0-dev

2025-07-01 11:33:20 +03:00

seastarx.hh

…

serialization_visitors.hh

…

serializer_impl.hh

serializer_impl.hh: add as_input_stream(managed_bytes_view) overload

2025-05-13 10:32:32 +02:00

serializer.cc

…

serializer.hh

treewide: include boost headers as "system" headers

2025-08-22 17:21:24 +03:00

service_permit.hh

…

shell.nix

…

sstable_dict_autotrainer.cc

storage_service: hold group0 gate in publish_new_sstable_dict

2025-07-28 12:42:37 +02:00

sstable_dict_autotrainer.hh

dict_autotrainer: introduce sstable_dict_autotrainer

2025-04-01 00:07:30 +02:00

sstables_loader.cc

sstables: make sstable::estimated_keys_for_range asynchronous

2025-09-29 13:01:21 +02:00

sstables_loader.hh

db/view/view_building_worker: register staging sstable to view building coordinator when needed

2025-08-27 10:23:03 +02:00

supervisor.hh

…

table_helper.cc

everywhere: use utils::chunked_vector for list of mutations

2025-07-13 19:13:11 +03:00

table_helper.hh

audit: Add the audit subsystem

2025-01-15 11:10:35 +01:00

test.py

vector_store_client_test: Relocate to a dedicated directory

2025-09-25 14:04:28 +02:00

timeout_config.cc

…

timeout_config.hh

…

tombstone_gc_extension.hh

schema: deprecate schema_extension

2025-03-19 20:36:16 +02:00

tombstone_gc_options.cc

…

tombstone_gc_options.hh

…

tombstone_gc-internals.hh

treewide: Add missing #pragma once

2025-09-01 14:58:21 +03:00

tombstone_gc.cc

tombstone_gc: Add overload of get_default_tombstone_gc_mode

2025-08-27 13:00:10 +02:00

tombstone_gc.hh

tombstone_gc: Add overload of get_default_tombstone_gc_mode

2025-08-27 13:00:10 +02:00

ubsan-suppressions.supp

…

unimplemented.cc

…

unimplemented.hh

…

validation.cc

treewide: Move keys related files to a new keys directory

2025-07-25 10:45:32 +03:00

validation.hh

…

version.hh

…

view_info.hh

treewide: Move query related files to a new query directory

2025-09-16 23:40:47 +03:00

vint-serialization.cc

treewide: Replace __builtin_expect with (un)likely

2025-07-03 13:34:04 +03:00

vint-serialization.hh

…

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain, This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.7%

Python 26.1%

CMake 0.3%

GAP 0.3%

Shell 0.3%