Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.
Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md
Just a simplification.
Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:
```
int main(int argc, char **argv) {
int x = -1;
unsigned y = 2;
fmt::print("{}\n", x < y);
return 0;
}
```
this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `locator::topology::config::this_host_id` field is redundant
in all places that use `locator::topology::config`, so we can
safely remove it.
Closes#14638Closes#14723
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.
The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.
With this fix the test passes both with
--consistent-cluster-management option and
without it.
The specific unit test for this problem was added.
Fixes: #14184Closes#14673
locator/*_snitch.cc updated for http::reply losing the _status_code
member without a deprecation notice.
* seastar 99d28ff057...2b7a341210 (23):
> Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity
Fixes#8828.
> reactor: use structured binding when appropriate
> Simplify payload length and mask parsing.
> memcached: do not used deprecated API
> build: serialize calls to openssl certificate generation
> reactor: epoll backend: initialize _highres_timer_pending
> shared_ptr: deprecate lw_shared_ptr operator=(T&&)
> tests: fail spawn_test if output is empty
> Support specifying the "build root" in configure
> Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov
> build: correct the syntax error in comment
> util: print_safe: fix hex print functions
> Add code examples for handling exceptions
> smp: warn if --memory parameter is not supported
> Merge 'gate: track holders' from Benny Halevy
> file: call lambda with std::invoke()
> deleter: Delete move and copy constructors
> file: fix the indent
> file: call close() without the syscall thread
> reactor: use s/::free()/::io_uring_free_probe()/
> Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai
> reactor: Don't re-evaliate local reactor for thread_pool
> Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov
Closes#14602
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.
Existing users will be transitioned incrementally in later patches.
on_internal_error is wrong for fence_version
condition violation, since in case of
topology change coordinator migrating to another
node we can have raft_topology_cmd::command::fence
command from the old coordinator running in
parallel with the fence command (or topology version
upgrading raft command) from the new one.
The comment near the raft_topology_cmd::command::fence
handling describes this situation, assuming an exception
is thrown in this case.
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.
The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
In get_range_addresses we are iterating
over vnode tokens, don't need to do
binary search for them in tmptr->first_token,
they can be directly used as keys
for _replication_map.
We add the has_pending_ranges function to erm. The
implementation for vnode is similar to that of token_metadata.
For tablets, we add new code that checks if the given endpoint
is contained in tablet_map::_transitions.
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
We already use the new pending_endpoints from erm though
the get_pending_ranges virtual function, in this commit
we update all the remaining places to use the new
implementation in erm, as well as remove the old implementation
in token_metadata.
In this commit we introduce functions to erm for accessing
pending_endpoints and read_endpoints similar to the
corresponding functions in token_metadata. The only
difference - we no longer need the keyspace_name map.
The functions get_pending_endpoints and get_endpoints_for_reading
are virtual, since they have different implementations
for vnode and for tablets.
The get_pending_endpoints already existed. For tablets it
remained unchanged, while for vnode we just changed
it from calling on token_metadata to using a local field.
We have also removed ks_name from the signature as it's
no longer needed.
For vnodes, the get_endpoints_for_reading also just
employs the local field. In the case of tablets, we currently
return nullptr as the appropriate implementation remains unclear.
In this commit we add logic to calculate pending_endpoints and
read_endpoints, similar to how it was done in update_pending_ranges.
For situations where 'natural_endpoints_depend_on_token'
is false we short-circuit the calculations, breaking out
of the loop after the first iteration. In this case we add a
single item with key=default_replication_map_key
to the replication_map and set pending_endpoints/read_endpoints
key range to the entire set of possible values.
In the loop we iterate over all_tokens, which contains the union of
all boundary tokens, from the old and from the new topology.
In addition to updating pending_endpoints and read_endpoints in the loop,
we remember the new natural endpoints in the replication_map
if the current token is contained in the current set of boundary tokens.
We optimise memory usage of replication_map by
storing endpoints list only once in case of
natural_endpoints_depend_on_token() == false. For simplicity,
this list is stored in the same unordered_map with
special key default_replication_map_key.
We inline both get_natural_endpoints and
for_each_natural_endpoint_until from abstract_replication_strategy
into vnode_erm since now the overrides in local and everywhere
strategies are redundant. The default implementation works
for them as empty sorted_tokens() is not a problem, we
store endpoints with a special key.
Function do_get_natural_endpoints was extracted,
since get_natural_endpoints returns by val,
but for_each_natural_endpoint_until reference in sufficient.
We want to refactor replication_map so that it doesn't
store multiple copies of the same endpoints vector
in case of natural_endpoints_depend_on_token == false.
To preserve get_range_addresses behaviour
we iterate over tm.sorted_tokens() instead of
_replication_map.
It's possible that the callers of this function
are ok with single range in case of
natural_endpoints_depend_on_token == false,
but to restrict the scope of the refactoring we
refrain from going to that direction.
We need to account for the new fields in the clone implementation.
The signature future<erm> erm::clone() const; doesn't work because
the call will be made via foreign_ptr on an instance from another
shard, so we need to use local values for replication_strategy
and token_metadata.
Refactor ~vnode_effective_replication_map, use
our new clear_gently overload for rvalue references.
Add new fields _pending_endpoints and _read_endpoints
to the call.
vnode_efficient_replication_map::clear_gently is removed as
it was not used.
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.
At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.
The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.
The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.
In this patch we add
token_metadata::set_topology_transition_state method.
If the current state is
write_both_read_new update_pending_ranges
will compute new ranges for read requests. The default value
of topology_transition_state is null, meaning no read
ranges are computed. We will add the appropriate
set_topology_transition_state calls later.
Also, we add endpoints_for_reading method to get
read endpoints based on the computed ranges.
We are going to add a function in token_metadata to get read endpoints,
similar to pending_endpoints_for. So in this commit we extract
the maybe_migration_endpoints helper function, which will be
used in both cases.
We are going to store read_endpoints in a way similar
to pending ranges, so in this commit we add
migration_info - a container for two
boost::icl::interval_map.
Also, _pending_ranges_interval_map is renamed to
_keyspace_to_migration_info, since it captures
the meaning better.
Now update_pending_ranges is quite complex, mainly
because it tries to act efficiently and update only
the affected intervals. However, it uses the function
abstract_replication_strategy::get_ranges, which calls
calculate_natural_endpoints for every token
in the ring anyway.
Our goal is to start reading from the new replicas for
ranges in write_both_read_new state. In the current
code structure this is quite difficult to do, so
in this commit we first simplify update_pending_ranges.
The main idea of the refactoring is to build a new version
of token_metadata based on all planned changes
(join, bootstrap, replace) and then for each token
range compare the result of calculate_natural_endpoints on
the old token_metadata and on the new one.
Those endpoints that are in the new version and
are not in the old version should be added to the pending_ranges.
The add_mapping function is extracted for the
future - we are going to use it to handle read mappings.
Special care is taken when replacing with the same IP.
The coordinator employs the
get_natural_endpoints_without_node_being_replaced function,
which excludes such endpoints from its result. If we compare
the new (merged) and current token_metadata configurations, such
endpoints will also be absent from pending_endpoints since
they exist in both. To address this, we copy the current
token_metadata and remove these endpoints prior to comparison.
This ensures that nodes being replaced are treated
like those being deleted.
token_metadata takes token_metadata_impl as unique_ptr,
so it makes sense to create it that way in the first place
to avoid unnecessary moves.
token_metadata_impl constructor with shallow_copy parameter
was made public for std::make_unique. The effective
accessibility of this constructor hasn't changed though since
shallow_copy remains private.
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.
see also #13243Closes#13723
* github.com:scylladb/scylladb:
cdc: initialize an optional using its value type
compaction: disambiguate type name
db: schema_tables: drop unused variable
reader_concurrency_semaphore: fix signed/unsigned comparision
locator: topology: disambiguate type names
raft: disambiguate promise name in raft::awaited_conf_changes
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687