In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.
The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.
Closes#14901
* github.com:scylladb/scylladb:
test_fencing: add test_fence_hints
test.py: output the skipped tests
test.py: add skip_mode decorator and fixture
test.py: add mode fixture
hints: add debug log for dropped hints
hints: send_one_hint: extend the scope of file_send_gate holder
pylib: add ScyllaMetrics
hints manager: add send_errors counter
token_metadata: add debug logs
fencing: add simple data plane test
random_tables.py: add counter column type
raft topology: don't increment version when transitioning to node_state::normal
We log the new version when the new token
metadata is set.
Also, the log for fence_version is moved
in shared_token_metadata from storage_service
for uniformity.
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
```
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```
Closes#14863
* github.com:scylladb/scylladb:
Kill scylla option to configure number of compaction groups
replica: Wire tablet into compaction group
token_metadata: Add this_host_id to topology config
replica: Switch to chunked_vector for storing compaction groups
replica: Generate group_id for compaction_group on demand
The motivation is that token_metadata::get_my_id() is not available
early in the bootstrap process, as raft topology is pulled later
than new tables are registered and created, and this node is added
to topology even later.
To allow creation of compaction groups to retrieve "my id" from
token metadata early, initialization will now feed local id
into topology config which is immutable for each node anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The use-after-move is not very harmful as it's only used when
handling exception. So user would be left with a bogus message.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15054
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.
This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.
Fixesscylladb/scylladb#14896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14897
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.
The migration track is interrupted by pending topology change operations.
The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
Closes#14851
* github.com:scylladb/scylladb:
tablets: load_balancer: Remove double logging
tests: tablets: Check that load balancing is interrupted by topology change
tests: tablets: Add test for load balancing with active migrations
tablets: Balance tablets concurrently with active migrations
storage_service, tablets: Extract generate_migration_updates()
storage_service, tablets: Move get_leaving_replica() to tablets.cc
locator: tablets: Move std::hash definition earlier
storage_service: Advance tablets independently
topology_coordinator: Fix missed notification on abort
tablets: Add formatter for tablet_migration_info
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.
Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.
Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md
Just a simplification.
Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:
```
int main(int argc, char **argv) {
int x = -1;
unsigned y = 2;
fmt::print("{}\n", x < y);
return 0;
}
```
this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `locator::topology::config::this_host_id` field is redundant
in all places that use `locator::topology::config`, so we can
safely remove it.
Closes#14638Closes#14723
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.
The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.
With this fix the test passes both with
--consistent-cluster-management option and
without it.
The specific unit test for this problem was added.
Fixes: #14184Closes#14673
locator/*_snitch.cc updated for http::reply losing the _status_code
member without a deprecation notice.
* seastar 99d28ff057...2b7a341210 (23):
> Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity
Fixes#8828.
> reactor: use structured binding when appropriate
> Simplify payload length and mask parsing.
> memcached: do not used deprecated API
> build: serialize calls to openssl certificate generation
> reactor: epoll backend: initialize _highres_timer_pending
> shared_ptr: deprecate lw_shared_ptr operator=(T&&)
> tests: fail spawn_test if output is empty
> Support specifying the "build root" in configure
> Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov
> build: correct the syntax error in comment
> util: print_safe: fix hex print functions
> Add code examples for handling exceptions
> smp: warn if --memory parameter is not supported
> Merge 'gate: track holders' from Benny Halevy
> file: call lambda with std::invoke()
> deleter: Delete move and copy constructors
> file: fix the indent
> file: call close() without the syscall thread
> reactor: use s/::free()/::io_uring_free_probe()/
> Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai
> reactor: Don't re-evaliate local reactor for thread_pool
> Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov
Closes#14602
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.
Existing users will be transitioned incrementally in later patches.
on_internal_error is wrong for fence_version
condition violation, since in case of
topology change coordinator migrating to another
node we can have raft_topology_cmd::command::fence
command from the old coordinator running in
parallel with the fence command (or topology version
upgrading raft command) from the new one.
The comment near the raft_topology_cmd::command::fence
handling describes this situation, assuming an exception
is thrown in this case.
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.
The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
In get_range_addresses we are iterating
over vnode tokens, don't need to do
binary search for them in tmptr->first_token,
they can be directly used as keys
for _replication_map.
We add the has_pending_ranges function to erm. The
implementation for vnode is similar to that of token_metadata.
For tablets, we add new code that checks if the given endpoint
is contained in tablet_map::_transitions.
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
We already use the new pending_endpoints from erm though
the get_pending_ranges virtual function, in this commit
we update all the remaining places to use the new
implementation in erm, as well as remove the old implementation
in token_metadata.
In this commit we introduce functions to erm for accessing
pending_endpoints and read_endpoints similar to the
corresponding functions in token_metadata. The only
difference - we no longer need the keyspace_name map.
The functions get_pending_endpoints and get_endpoints_for_reading
are virtual, since they have different implementations
for vnode and for tablets.
The get_pending_endpoints already existed. For tablets it
remained unchanged, while for vnode we just changed
it from calling on token_metadata to using a local field.
We have also removed ks_name from the signature as it's
no longer needed.
For vnodes, the get_endpoints_for_reading also just
employs the local field. In the case of tablets, we currently
return nullptr as the appropriate implementation remains unclear.
In this commit we add logic to calculate pending_endpoints and
read_endpoints, similar to how it was done in update_pending_ranges.
For situations where 'natural_endpoints_depend_on_token'
is false we short-circuit the calculations, breaking out
of the loop after the first iteration. In this case we add a
single item with key=default_replication_map_key
to the replication_map and set pending_endpoints/read_endpoints
key range to the entire set of possible values.
In the loop we iterate over all_tokens, which contains the union of
all boundary tokens, from the old and from the new topology.
In addition to updating pending_endpoints and read_endpoints in the loop,
we remember the new natural endpoints in the replication_map
if the current token is contained in the current set of boundary tokens.
We optimise memory usage of replication_map by
storing endpoints list only once in case of
natural_endpoints_depend_on_token() == false. For simplicity,
this list is stored in the same unordered_map with
special key default_replication_map_key.
We inline both get_natural_endpoints and
for_each_natural_endpoint_until from abstract_replication_strategy
into vnode_erm since now the overrides in local and everywhere
strategies are redundant. The default implementation works
for them as empty sorted_tokens() is not a problem, we
store endpoints with a special key.
Function do_get_natural_endpoints was extracted,
since get_natural_endpoints returns by val,
but for_each_natural_endpoint_until reference in sufficient.
We want to refactor replication_map so that it doesn't
store multiple copies of the same endpoints vector
in case of natural_endpoints_depend_on_token == false.
To preserve get_range_addresses behaviour
we iterate over tm.sorted_tokens() instead of
_replication_map.
It's possible that the callers of this function
are ok with single range in case of
natural_endpoints_depend_on_token == false,
but to restrict the scope of the refactoring we
refrain from going to that direction.
We need to account for the new fields in the clone implementation.
The signature future<erm> erm::clone() const; doesn't work because
the call will be made via foreign_ptr on an instance from another
shard, so we need to use local values for replication_strategy
and token_metadata.
Refactor ~vnode_effective_replication_map, use
our new clear_gently overload for rvalue references.
Add new fields _pending_endpoints and _read_endpoints
to the call.
vnode_efficient_replication_map::clear_gently is removed as
it was not used.
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.
At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.
The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.
The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.