Go to file

Avi Kivity c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision

2024-01-31 13:59:56 +02:00

.github

.git: add more skip words

2024-01-29 14:37:03 +02:00

alternator

Merge 'alternator: enable tablets by default if experimental feature is enabled' from Nadav Har'El

2024-01-29 09:22:13 +02:00

api

build: cmake: include raft.cc in api library

2024-01-31 11:39:41 +02:00

auth

service/maintenance_mode: move maintenance_socket_enabled definition to seperate file

2024-01-25 15:27:53 +01:00

bin

tools: add cqlsh shortcut

2023-07-12 09:36:59 +03:00

cdc

cdc: not include unused headers

2024-01-11 09:13:37 +02:00

cmake

build: cmake: use # for line comment

2024-01-03 15:05:00 +02:00

compaction

compaction/compaction_manager: perform_cleanup(): hold the compaction gate

2024-01-25 14:52:50 +01:00

conf

Merge 'Add maintenance socket' from Mikołaj Grzebieluch

2023-12-20 19:04:40 +02:00

cql3

treewide: fix misspellings in code comments

2024-01-31 09:16:10 +02:00

data_dictionary

keyspace_metadata: Drop vector-of-schemas argument from new_keyspace()

2023-12-26 13:00:44 +03:00

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

debug

…

dht

db: add formatter for dht::decorated_key and repair_sync_boundary

2024-01-29 11:11:41 +02:00

direct_failure_detector

…

dist

scylla_util.py: wait for apt operation on other processes

2023-12-28 19:00:36 +02:00

docs

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

exceptions

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

gms

Merge 'Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal' from Kamil Braun

2024-01-12 10:51:21 +02:00

idl

storage_service: Implement table_load_stats RPC

2024-01-25 18:36:08 -03:00

index

Merge 'scylla-sstable: add support for loading schema of views and indexes' from Botond Dénes

2024-01-24 23:36:54 +02:00

interface

Typos: fix typos in comments

2023-12-02 22:37:22 +02:00

lang

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

licenses

…

locator

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

message

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

mutation

treewide: fix misspellings in code comments

2024-01-31 09:16:10 +02:00

mutation_writer

mutation_writer: do not include unused headers

2024-01-24 15:20:02 +02:00

node_ops

token_metadata: drop the template

2023-12-12 23:19:54 +04:00

raft

Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun

2024-01-29 15:06:04 +02:00

readers

reader: do not include unused headers

2024-01-29 16:21:42 +02:00

redis

redis: do not include unused headers

2024-01-31 09:17:18 +02:00

reloc

…

repair

treewide: fix misspellings in code comments

2024-01-31 09:16:10 +02:00

replica

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

rust

rust: update dependencies

2023-12-17 13:20:25 +02:00

schema

schema: provide method to get sharder, iff it is static

2024-01-23 22:20:59 +02:00

scripts

Typos: fix typos in code

2023-12-13 10:45:21 +02:00

seastar @ 85359b2866

Update seastar submodule

2024-01-22 11:29:50 +01:00

service

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

sstables

treewide: fix misspellings in code comments

2024-01-31 09:16:10 +02:00

streaming

Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec

2024-01-25 14:49:43 +02:00

swagger-ui @ 12f1da1082

…

tasks

tasks: don't keep internal root tasks after they complete

2024-01-09 13:13:54 +01:00

test

Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho

2024-01-31 13:59:56 +02:00

thrift

thrift: remove unused namespace definition

2024-01-30 09:16:47 +02:00

tools

tools/scylla-sstable.cc: use utils::directories to get paths

2024-01-29 13:11:33 +01:00

tracing

tracing: add formatter for tracing::span_id

2024-01-31 13:43:46 +02:00

transport

cql_transport/controler: use utils::directories to get paths of dirs

2024-01-29 13:20:38 +01:00

types

utils: do not include unused headers

2024-01-18 12:50:06 +02:00

unified

Update unified/build_unified.sh

2023-12-05 15:23:38 +02:00

utils

utils: Coroutinize disk_sanity()

2024-01-31 09:20:21 +02:00

.dockerignore

…

.gitattributes

…

.gitignore

docs: download iam csv files

2023-10-02 12:28:56 +03:00

.gitmodules

…

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

…

absl-flat_hash_map.hh

…

amplify.yml

…

backlog_controller.hh

treewide: apply codespell to the comments in source code

2023-12-20 10:25:03 +02:00

build_mode.hh

…

bytes_ostream.hh

…

bytes.cc

…

bytes.hh

bytes.hh: correct spelling of delimiter and delimited

2023-12-18 20:46:21 +02:00

cache_flat_mutation_reader.hh

cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound()

2023-11-16 19:01:18 +01:00

cache_temperature.hh

…

cartesian_product.hh

…

cell_locking.hh

…

checked-file-impl.hh

code: Switch to seastar API level 7

2023-06-06 13:29:16 +03:00

client_data.cc

…

client_data.hh

…

clocks-impl.cc

clocks-impl: format time_point using fmt

2023-11-22 17:44:07 +02:00

clocks-impl.hh

…

clustering_bounds_comparator.hh

…

clustering_interval_set.hh

…

clustering_key_filter.hh

…

clustering_ranges_walker.hh

…

CMakeLists.txt

build: cmake: add "mode_list" target

2023-12-24 12:35:02 +08:00

collection_mutation.cc

…

collection_mutation.hh

…

column_computation.hh

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

combine.hh

…

compound_compat.hh

compound_compat: do not format an sstring with {:d}

2023-07-08 15:13:11 +03:00

compound.hh

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

compress.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

compress.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

concrete_types.hh

make timestamp string format cassandra compatible

2023-07-27 12:01:09 +03:00

configure.py

configure.py: s/-DBOOST_TEST_DYN_LINK/-DBOOST_ALL_DYN_LINK/

2024-01-31 12:21:31 +02:00

CONTRIBUTING.md

…

converting_mutation_partition_applier.cc

…

converting_mutation_partition_applier.hh

…

counters.cc

counters: move fmt::formatter<counter_{shard,cell}_view>::format() to .cc

2023-05-24 09:36:49 +03:00

counters.hh

counters: move fmt::formatter<counter_{shard,cell}_view>::format() to .cc

2023-05-24 09:36:49 +03:00

coverage_excludes.txt

test.py: support code coverage

2024-01-18 11:11:34 +02:00

coverage_sources.list

configure.py support coverage profiles on standrad build modes

2024-01-18 11:11:34 +02:00

cql_serialization_format.hh

…

db_clock.hh

…

debug.cc

…

debug.hh

…

default.nix

…

Doxyfile

…

duration.cc

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

duration.hh

…

encoding_stats.hh

encoding_state: mark helper methods protected

2023-08-29 15:41:13 +03:00

enum_set.hh

…

fix_system_distributed_tables.py

…

flake.lock

…

flake.nix

…

frozen_schema.cc

…

frozen_schema.hh

…

full_position.hh

…

gc_clock.hh

…

gdbinit

…

gen_segmented_compress_params.py

Typos: fix typos in code

2023-12-13 10:45:21 +02:00

generic_server.cc

generic_server: use mutable reference in for_each_gently

2023-11-14 14:25:22 +02:00

generic_server.hh

generic_server: use mutable reference in for_each_gently

2023-11-14 14:25:22 +02:00

HACKING.md

…

hashing_partition_visitor.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

idl-compiler.py

Typos: fix typos in code

2023-12-13 10:45:21 +02:00

inet_address_vectors.hh

abstract_replication_strategy: calculate_natural_endpoints: make it work with both versions of token_metadata

2023-12-12 23:19:53 +04:00

init.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

init.hh

Merge 'Typos: fix typos in code' from Yaniv Kaul

2023-12-06 07:36:41 +02:00

install-dependencies.sh

build: add crypto++ to dependencies

2024-01-11 16:26:20 +02:00

install.sh

install.sh: use a temporary file when packaging scylla.yaml

2024-01-01 21:50:29 +02:00

interval.hh

interval: make default ctor and make_open_ended_both_sides constexpr

2023-11-06 18:39:53 +01:00

keys.cc

keys: Move exploded_clustering_prefix's operator<< to keys.cc

2023-07-19 11:57:27 +03:00

keys.hh

keys: do not use zip_iterator for printing key components

2023-07-01 23:49:02 +03:00

LICENSE.AGPL

…

log.hh

…

main.cc

Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes

2024-01-29 23:39:41 +01:00

map_difference.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

marshal_exception.hh

…

multishard_mutation_query.cc

reader_permit: store schema_ptr instead of raw schema pointer

2024-01-11 08:37:56 +02:00

multishard_mutation_query.hh

treewide: apply codespell to the comments in source code

2023-12-20 10:25:03 +02:00

mutation_query.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

mutation_query.hh

mutation_query: add formatter for reconcilable_result::printer

2023-11-26 20:20:50 +02:00

noexcept_traits.hh

…

NOTICE.txt

…

ORIGIN

…

partition_builder.hh

…

partition_range_compat.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

partition_slice_builder.cc

partition_slice_builder: add set_specific_ranges()

2023-05-08 07:35:39 -04:00

partition_slice_builder.hh

partition_slice_builder: add set_specific_ranges()

2023-05-08 07:35:39 -04:00

partition_snapshot_reader.hh

…

partition_snapshot_row_cursor.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

protocol_server.hh

…

querier.cc

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

querier.hh

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

query_id.hh

…

query_ranges_to_vnodes.cc

everywhere: reduce dependencies on i_partitioner.hh

2023-11-05 20:47:44 +02:00

query_ranges_to_vnodes.hh

everywhere: reduce dependencies on i_partitioner.hh

2023-11-05 20:47:44 +02:00

query_result_merger.hh

…

query-request.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

query-result-reader.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

query-result-set.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

query-result-set.hh

…

query-result-writer.hh

…

query-result.hh

treewide: do not mark return value const if this has no effect

2023-11-17 17:46:19 +08:00

query.cc

treewide: use #include <seastar/...> for seastar headers

2023-06-06 08:36:09 +03:00

range.hh

…

read_context.hh

compact and remove expired rows from cache on read

2023-06-26 15:29:01 +02:00

reader_concurrency_semaphore.cc

reader_concurrency_semaphore.cc: move stringstream content instead of copying it

2024-01-31 09:31:50 +02:00

reader_concurrency_semaphore.hh

reader_permit: store schema_ptr instead of raw schema pointer

2024-01-11 08:37:56 +02:00

reader_permit.hh

reader_permit: store schema_ptr instead of raw schema pointer

2024-01-11 08:37:56 +02:00

README.md

…

real_dirty_memory_accounter.hh

real_dirty_memory_accounter: document what the class is doing

2023-05-23 09:11:31 +03:00

release.cc

…

release.hh

…

reversibly_mergeable.hh

…

row_cache.cc

reader: do not include unused headers

2024-01-29 16:21:42 +02:00

row_cache.hh

Merge 'row_cache: abort on exteral_updater::execute errors' from Benny Halevy

2023-10-31 10:07:01 +02:00

schema_mutations.cc

schema_mutations, migration_manager: Ignore empty partitions in per-table digest

2023-07-03 23:06:55 +02:00

schema_mutations.hh

schema_mutations, migration_manager: Ignore empty partitions in per-table digest

2023-07-03 23:06:55 +02:00

schema_upgrader.hh

…

scylla_post_install.sh

dist: drop legacy control group parameters

2023-12-11 19:38:28 +09:00

scylla-gdb.py

reader_permit: store schema_ptr instead of raw schema pointer

2024-01-11 08:37:56 +02:00

SCYLLA-VERSION-GEN

Typos: fix typos in code

2023-12-13 10:45:21 +02:00

seastarx.hh

…

serialization_visitors.hh

…

serializer_impl.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

serializer.cc

…

serializer.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

service_permit.hh

…

setup.py

…

shell.nix

…

sstables_loader.cc

sstables_loader: load_new_sstables: auto-enable load-and-stream for tablets

2024-01-16 18:43:52 +02:00

sstables_loader.hh

…

supervisor.hh

…

table_helper.cc

keyspace_metadata: Add default value for new_keyspace's durable_writes

2023-12-26 11:47:37 +03:00

table_helper.hh

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

test.py

test.py: add boost_tests() to suite

2024-01-31 13:43:21 +02:00

timeout_config.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

timeout_config.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

timestamp.hh

…

tombstone_gc_extension.hh

…

tombstone_gc_options.cc

…

tombstone_gc_options.hh

…

tombstone_gc.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

tombstone_gc.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

tox.ini

…

ubsan-suppressions.supp

…

unimplemented.cc

unimplemented: add format_as() for unimplemented::cause

2024-01-19 08:38:30 +02:00

unimplemented.hh

./: not include unused headers

2024-01-17 16:30:14 +02:00

validation.cc

…

validation.hh

…

version.hh

…

view_info.hh

everywhere: reduce dependencies on i_partitioner.hh

2023-11-05 20:47:44 +02:00

vint-serialization.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

vint-serialization.hh

Typos: fix typos in code

2023-12-05 15:18:11 +02:00

zstd.cc

./: not include unused headers

2024-01-17 16:30:14 +02:00

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++20 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain, This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and Thrift. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of the ScyllaDB open source.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.7%

Python 26.1%

CMake 0.3%

GAP 0.3%

Shell 0.3%