File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closes scylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes #3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106
Closes scylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes #4265.
Closes scylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes #4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closes scylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254
Closes scylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closes scylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closes scylladb/scylladb#22467
181 lines
8.9 KiB
C++
181 lines
8.9 KiB
C++
/*
|
|
* Copyright (C) 2018-present ScyllaDB
|
|
*/
|
|
|
|
/*
|
|
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include <seastar/core/sstring.hh>
|
|
#include <seastar/core/future.hh>
|
|
#include <seastar/core/sharded.hh>
|
|
#include <unordered_map>
|
|
#include <functional>
|
|
#include <set>
|
|
#include <unordered_set>
|
|
#include "seastarx.hh"
|
|
#include "db/schema_features.hh"
|
|
#include "gms/feature.hh"
|
|
|
|
namespace db {
|
|
class config;
|
|
class system_keyspace;
|
|
}
|
|
namespace service { class storage_service; }
|
|
|
|
namespace gms {
|
|
|
|
class gossiper;
|
|
class feature_service;
|
|
class i_endpoint_state_change_subscriber;
|
|
|
|
struct feature_config {
|
|
private:
|
|
std::set<sstring> _disabled_features;
|
|
feature_config();
|
|
|
|
friend class feature_service;
|
|
friend feature_config feature_config_from_db_config(const db::config& cfg, std::set<sstring> disabled);
|
|
};
|
|
|
|
feature_config feature_config_from_db_config(const db::config& cfg, std::set<sstring> disabled = {});
|
|
|
|
class unsupported_feature_exception : public std::runtime_error {
|
|
public:
|
|
unsupported_feature_exception(std::string what)
|
|
: runtime_error(std::move(what))
|
|
{}
|
|
};
|
|
|
|
using namespace std::literals;
|
|
|
|
/**
|
|
* A gossip feature tracks whether all the nodes the current one is
|
|
* aware of support the specified feature.
|
|
*
|
|
* A pointer to `cql3::query_processor` can be optionally supplied
|
|
* if the instance needs to persist enabled features in a system table.
|
|
*/
|
|
class feature_service final : public peering_sharded_service<feature_service> {
|
|
void register_feature(feature& f);
|
|
void unregister_feature(feature& f);
|
|
friend class feature;
|
|
std::unordered_map<sstring, std::reference_wrapper<feature>> _registered_features;
|
|
std::unordered_set<sstring> _suppressed_features;
|
|
|
|
feature_config _config;
|
|
|
|
future<> enable_features_on_startup(db::system_keyspace&);
|
|
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
|
|
void initialize_suppressed_features_set();
|
|
#endif
|
|
public:
|
|
explicit feature_service(feature_config cfg);
|
|
~feature_service() = default;
|
|
future<> stop();
|
|
future<> enable(std::set<std::string_view> list);
|
|
db::schema_features cluster_schema_features() const;
|
|
std::set<std::string_view> supported_feature_set() const;
|
|
|
|
// Key in the 'system.scylla_local' table, that is used to
|
|
// persist enabled features
|
|
static constexpr const char* ENABLED_FEATURES_KEY = "enabled_features";
|
|
|
|
public:
|
|
gms::feature user_defined_functions { *this, "UDF"sv };
|
|
gms::feature alternator_streams { *this, "ALTERNATOR_STREAMS"sv };
|
|
gms::feature alternator_ttl { *this, "ALTERNATOR_TTL"sv };
|
|
gms::feature range_scan_data_variant { *this, "RANGE_SCAN_DATA_VARIANT"sv };
|
|
gms::feature cdc_generations_v2 { *this, "CDC_GENERATIONS_V2"sv };
|
|
gms::feature user_defined_aggregates { *this, "UDA"sv };
|
|
// Historically max_result_size contained only two fields: soft_limit and
|
|
// hard_limit. It was somehow obscure because for normal paged queries both
|
|
// fields were equal and meant page size. For unpaged queries and reversed
|
|
// queries soft_limit was used to warn when the size of the result exceeded
|
|
// the soft_limit and hard_limit was used to throw when the result was
|
|
// bigger than this hard_limit. To clean things up, we introduced the third
|
|
// field into max_result_size. It's name is page_size. Now page_size always
|
|
// means the size of the page while soft and hard limits are just what their
|
|
// names suggest. They are no longer interpreted as page size. This is not
|
|
// a backwards compatible change so this new cluster feature is used to make
|
|
// sure the whole cluster supports the new page_size field and we can safely
|
|
// send it to replicas.
|
|
gms::feature separate_page_size_and_safety_limit { *this, "SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT"sv };
|
|
// Replica is allowed to send back empty pages to coordinator on queries.
|
|
gms::feature empty_replica_pages { *this, "EMPTY_REPLICA_PAGES"sv };
|
|
gms::feature empty_replica_mutation_pages { *this, "EMPTY_REPLICA_MUTATION_PAGES"sv };
|
|
gms::feature supports_raft_cluster_mgmt { *this, "SUPPORTS_RAFT_CLUSTER_MANAGEMENT"sv };
|
|
gms::feature tombstone_gc_options { *this, "TOMBSTONE_GC_OPTIONS"sv };
|
|
gms::feature parallelized_aggregation { *this, "PARALLELIZED_AGGREGATION"sv };
|
|
gms::feature keyspace_storage_options { *this, "KEYSPACE_STORAGE_OPTIONS"sv };
|
|
gms::feature typed_errors_in_read_rpc { *this, "TYPED_ERRORS_IN_READ_RPC"sv };
|
|
gms::feature uda_native_parallelized_aggregation { *this, "UDA_NATIVE_PARALLELIZED_AGGREGATION"sv };
|
|
gms::feature aggregate_storage_options { *this, "AGGREGATE_STORAGE_OPTIONS"sv };
|
|
gms::feature collection_indexing { *this, "COLLECTION_INDEXING"sv };
|
|
gms::feature large_collection_detection { *this, "LARGE_COLLECTION_DETECTION"sv };
|
|
gms::feature range_tombstone_and_dead_rows_detection { *this, "RANGE_TOMBSTONE_AND_DEAD_ROWS_DETECTION"sv };
|
|
gms::feature truncate_as_topology_operation { *this, "TRUNCATE_AS_TOPOLOGY_OPERATION"sv };
|
|
gms::feature secondary_indexes_on_static_columns { *this, "SECONDARY_INDEXES_ON_STATIC_COLUMNS"sv };
|
|
gms::feature tablets { *this, "TABLETS"sv };
|
|
gms::feature uuid_sstable_identifiers { *this, "UUID_SSTABLE_IDENTIFIERS"sv };
|
|
gms::feature table_digest_insensitive_to_expiry { *this, "TABLE_DIGEST_INSENSITIVE_TO_EXPIRY"sv };
|
|
// If this feature is enabled, schema versions are persisted by the group 0 command
|
|
// that modifies schema instead of being calculated as a digest (hash) by each node separately.
|
|
// The feature controls both the 'global' schema version (the one gossiped as application_state::SCHEMA)
|
|
// and the per-table schema versions (schema::version()).
|
|
// The feature affects non-Raft mode as well (e.g. during RECOVERY), where we send additional
|
|
// tombstones and flags to schema tables when performing schema changes, allowing us to
|
|
// revert to the digest method when necessary (if we must perform a schema change during RECOVERY).
|
|
gms::feature group0_schema_versioning { *this, "GROUP0_SCHEMA_VERSIONING"sv };
|
|
gms::feature supports_consistent_topology_changes { *this, "SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES"sv };
|
|
gms::feature host_id_based_hinted_handoff { *this, "HOST_ID_BASED_HINTED_HANDOFF"sv };
|
|
gms::feature topology_requests_type_column { *this, "TOPOLOGY_REQUESTS_TYPE_COLUMN"sv };
|
|
gms::feature native_reverse_queries { *this, "NATIVE_REVERSE_QUERIES"sv };
|
|
gms::feature zero_token_nodes { *this, "ZERO_TOKEN_NODES"sv };
|
|
gms::feature view_build_status_on_group0 { *this, "VIEW_BUILD_STATUS_ON_GROUP0"sv };
|
|
gms::feature views_with_tablets { *this, "VIEWS_WITH_TABLETS"sv };
|
|
|
|
// Whether to allow fragmented commitlog entries. While this is a node-local feature as such, hide
|
|
// behind a feature to ensure an upgrading cluster appears to be at least functional before using,
|
|
// to avoid data loss if rolling back in a dirty state, but also because it changes which/how mutations
|
|
// can be applied to a given node - i.e. with it on, a node can accept larger, say, schema mutations,
|
|
// whereas without it, it will fail the insert - i.e. for things like raft etc _all_ nodes should
|
|
// have it or none, otherwise we can get partial failures on writes.
|
|
gms::feature fragmented_commitlog_entries { *this, "FRAGMENTED_COMMITLOG_ENTRIES"sv };
|
|
gms::feature maintenance_tenant { *this, "MAINTENANCE_TENANT"sv };
|
|
|
|
gms::feature tablet_repair_scheduler { *this, "TABLET_REPAIR_SCHEDULER"sv };
|
|
gms::feature tablet_merge { *this, "TABLET_MERGE"sv };
|
|
|
|
gms::feature tablet_migration_virtual_task { *this, "TABLET_MIGRATION_VIRTUAL_TASK"sv };
|
|
gms::feature tablet_resize_virtual_task { *this, "TABLET_RESIZE_VIRTUAL_TASK"sv };
|
|
|
|
// A feature just for use in tests. It must not be advertised unless
|
|
// the "features_enable_test_feature" injection is enabled.
|
|
// This feature MUST NOT be advertised in release mode!
|
|
gms::feature test_only_feature { *this, "TEST_ONLY_FEATURE"sv };
|
|
gms::feature address_nodes_by_host_ids { *this, "ADDRESS_NODES_BY_HOST_IDS"sv };
|
|
|
|
gms::feature in_memory_tables { *this, "IN_MEMORY_TABLES"sv };
|
|
gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
|
|
gms::feature file_stream { *this, "FILE_STREAM"sv };
|
|
gms::feature compression_dicts { *this, "COMPRESSION_DICTS"sv };
|
|
public:
|
|
|
|
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;
|
|
|
|
static std::set<sstring> to_feature_set(sstring features_string);
|
|
future<> enable_features_on_join(gossiper&, db::system_keyspace&, service::storage_service&);
|
|
future<> on_system_tables_loaded(db::system_keyspace& sys_ks);
|
|
|
|
// Performs the feature check.
|
|
// Throws an unsupported_feature_exception if there is a feature either
|
|
// in `enabled_features` or `unsafe_to_disable_features` that is not being
|
|
// currently supported by this node.
|
|
void check_features(const std::set<sstring>& enabled_features, const std::set<sstring>& unsafe_to_disable_features);
|
|
};
|
|
|
|
} // namespace gms
|