Commit Graph

1387 Commits

Author SHA1 Message Date
Gleb Natapov
0e80c5162a storage_service: avoid unneeded copies in on_change
Move array of strings instead of copying.
2023-05-25 14:51:14 +03:00
Gleb Natapov
3a201c25c8 storage_service: remove check that is always true
The array cannot be empty since we access the first element of the array
before we call this function.
2023-05-25 14:50:23 +03:00
Gleb Natapov
715897ff31 storage_service: rename handle_state_removing to handle_state_removed
The function no longer handles REMOVING_TOKING state so rename the
function and drop no longer needed checks for the non existing state.
2023-05-25 14:48:58 +03:00
Gleb Natapov
4103281648 storage_service: avoid string copy 2023-05-25 14:48:39 +03:00
Gleb Natapov
05aa07835d storage_service: delete code that handled REMOVING_TOKENS state
The state is never advertised so the code is never used.
2023-05-25 14:48:09 +03:00
Tomasz Grabiec
809ddd7f79 Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr
This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`.

We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`.

At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring.

The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key
instead of duplicating them for each `vnode` token.

The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation.

Closes #13715

* github.com:scylladb/scylladb:
  token_metadata_test: add a test for everywhere strategy
  token_metadata_test: check read_endpoints when bootstrapping first node
  token_metadata_test: refactor tests, extract create_erm
  token_metadata: drop has_pending_ranges and migration_info
  effective_replication_map: add has_pending_ranges
  token_metadata: drop update_pending_ranges
  effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading
  token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers
  vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading
  calculate_effective_replication_map: compute pending_endpoints and read_endpoints
  vnode_erm: optimize replication_map
  vnode_erm::get_range_addresses: use sorted_tokens
  abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token
  sequenced_set: add extract_vector method
  effective_replication_map: clone_endpoints_gently -> clone_data_gently
  vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints
  stall_free.hh: add clear_gently for rvalues
  stall_free.hh: relax Container requirement
  token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map
  token_metadata: introduce topology_change_info
  token_metadata: replace set_topology_transition_state with set_read_new
2023-05-22 21:37:06 +02:00
Tomasz Grabiec
9d4bca26cc Merge 'raft topology: implement check_and_repair_cdc_streams API' from Kamil Braun
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.

It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.

This PR implements the API on top of raft topology, it was originally
implemented using gossiper.  It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.

Closes #13683

* github.com:scylladb/scylladb:
  docs: update topology-over-raft.md
  test: topology_experimental_raft: test `check_and_repair_cdc` API
  raft topology: implement `check_and_repair_cdc_streams` API
  raft topology: implement global request handling
  raft topology: introduce `prepare_new_cdc_generation_data`
  raft_topology: `get_node_to_work_on_opt`: return guard if no node found
  raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
  raft topology: separate `publish_cdc_generation` state
  raft topology: non-node-specific `exec_global_command`
  raft topology: introduce `start_operation()`
  raft topology: non-node-specific `topology_mutation_builder`
  topology_state_machine: introduce `global_topology_request`
  topology_state_machine: use `uint16_t` for `enum_class`es
  raft topology: make `new_cdc_generation_data_uuid` topology-global
2023-05-22 11:33:58 +02:00
Petr Gusev
5976277c2c token_metadata: drop has_pending_ranges and migration_info
Use the new erm::has_pending_ranges function, drop the
old implementation from token_metadata.
2023-05-21 13:17:42 +04:00
Petr Gusev
8cb709d3d6 token_metadata: drop update_pending_ranges
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
2023-05-21 13:17:42 +04:00
Petr Gusev
10bf8c7901 token_metadata: introduce topology_change_info
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.

At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.

The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.

The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.
2023-05-19 19:04:43 +04:00
Petr Gusev
51e80691ef token_metadata: replace set_topology_transition_state with set_read_new
This helps isolate topology::transition_state dependencies,
token_metadata doesn't need the entire enum, just  this
boolean flag.
2023-05-19 19:04:43 +04:00
Kamil Braun
13df85ea11 Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).

The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.

Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.

Fixes: #13837

Closes #13172

* github.com:scylladb/scylladb:
  gossiper: Remove features and sysks from gossiper
  system_keyspace: De-static save_local_supported_features()
  system_keyspace: De-static load_|save_local_enabled_features()
  system_keyspace: Move enable_features_on_startup to feature_service (cont)
  system_keyspace: Move enable_features_on_startup to feature_service
  feature_service: Open-code persist_enabled_feature_info() into enabler
  gms: Move feature enabler to feature_service.cc
  gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
  gms: Persist features explicitly in features enabler
  feature_service: Make persist_enabled_feature_info() return a future
  system_keyspace: De-static load_peer_features()
  gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
  gossiper: Enable features and register enabler from outside
  gms: Add feature_service and system_keyspace to feature_enabler
2023-05-18 18:21:06 +02:00
Gleb Natapov
701d6941a5 storage_proxy: raft topology: use gossiper state to populate peers table
Some state that is used to fill in 'peeers' table is still propagated
over gossiper.  When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.

Fixes: #13911

Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
2023-05-18 16:00:29 +02:00
Pavel Emelyanov
5216dcb1b3 Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.

This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.

This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.

Refs: #11870

Closes #13869

* github.com:scylladb/scylladb:
  db/system_keyspace: remove dependency on storage_proxy
  db/system_keyspace: replace storage_proxy::query*() with  replica:: equivalent
  replica: add query.hh
2023-05-18 10:53:27 +03:00
Avi Kivity
d2d53fc1db Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.

get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.

Fixes #13899

Closes #13900

* github.com:scylladb/scylladb:
  storage_service: do not preempt while traversing endpoint_state_map
  gossiper: do not preempt while traversing endpoint_state_map
2023-05-16 18:04:35 +03:00
Benny Halevy
1da0b0ff76 storage_service: do not preempt while traversing endpoint_state_map
The map iterators might be invalidated while yielding
on insert if the map is rehashed.
See https://en.cppreference.com/w/cpp/container/unordered_map/insert

Refs #13899

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-16 12:24:44 +03:00
Benny Halevy
502b5522ca storage_service: handle_state_normal: on_internal_error on "owns no tokens"
Although this condition should not happen,
we suspect that certain timing conditions might
lead this state of node in handle_normal_state
(possibly when shutdown) has no tokens.

Currently we call on_internal_error_noexcept, so
if abort_on_internal_error is false, we will just
print an error and continue on with handle_state_normal.

Change that to `on_internal_error` so to throw an
exception in production in this unexpected state.

Refs #13801

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-15 12:49:17 +03:00
Botond Dénes
157fdb2f6d db/system_keyspace: remove dependency on storage_proxy
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
2023-05-12 07:27:55 -04:00
Petr Gusev
08529a1c6c storage_proxy, storage_service: use new read endpoints
We use set_topology_transition_state to set read_new state
in storage_service::topology_state_load
based on _topology_state_machine._topology.tstate.
This triggers update_pending_ranges to compute and store new ranges
for read requests. We use this information in
storage_proxy::get_endpoints_for_reading
when we need to decide which nodes to use for reading.
2023-05-09 18:42:03 +04:00
Kamil Braun
372a06f735 raft topology: implement check_and_repair_cdc_streams API
The original API is gossiper-based. Since we're moving CDC generations
handling to Raft-based topology, we need to implement this API as well.

For now the API creates a new generation unconditionally, in a follow-up
I'll introduce a check to skip the creation if the current generation is
optimal.
2023-05-08 16:49:01 +02:00
Kamil Braun
a09ed01ffa raft topology: implement global request handling
The only possible request for now is creating a new CDC generation.
2023-05-08 16:49:01 +02:00
Kamil Braun
2bd333dd84 raft topology: introduce prepare_new_cdc_generation_data
Refactor the code, taking a bulk of the CDC-specific code used when
there's a bootstrap request to a separate function. We'll use it
elsewhere as well.
2023-05-08 16:48:59 +02:00
Kamil Braun
2863ef3df4 raft_topology: get_node_to_work_on_opt: return guard if no node found
We'll need the guard back.
2023-05-08 16:48:29 +02:00
Kamil Braun
afcf17f168 raft topology: remove node_to_work_on from commit_cdc_generation transition
We don't need it for anything in this state, and this change allows us
to commit CDC generations without transitioning nodes.
2023-05-08 16:47:37 +02:00
Kamil Braun
1b21a3c5ae raft topology: separate publish_cdc_generation state
Previously the generation committed in `commit_cdc_generation` state
would be published by the coordinator in `write_both_read_old` state.
This logic assumed that we only create new CDC generations during node
bootstrap.

We'll allow committing new generations without bootstrap (without any
node transitions in fact), so we need this separate state.

After publishing the generation, we check whether there is a
transitioning node; if so, we'll enter `write_both_read_old` as next
state, otherwise we'll make the topology non-transitioning.
2023-05-08 16:47:24 +02:00
Kamil Braun
6d5b8c1b7c raft topology: non-node-specific exec_global_command
This function broadcasts a command to cluster members. It takes a
`node_to_work_on`. We'll need a version which works in situations where
there is no 'node to work on'.
2023-05-08 16:47:13 +02:00
Kamil Braun
8b5237a058 raft topology: introduce start_operation()
This calls `raft_group0_client::start_operation` and checks if the term
is different from the term that the coordinator was initially created
with; if so, we must no longer continue coordinating the topology.

There was one direct call to `raft_group0_client::start_operation`
without a term check, replace it with the introduced function.
2023-05-08 16:47:13 +02:00
Kamil Braun
90770f712c raft topology: non-node-specific topology_mutation_builder
The existing `topology_mutation_builder` took a `raft::server_id` in its
constructor and immediately created a clustering row in the
`system.topology` mutation that it was building for the given node.
This does not allow building mutations which only affect the static
columns.

Split the class into two:
- `topology_mutation_builder` doesn't take `raft::server_id` in its
  constructor and contains only the methods that are used to set static
  columns. It also has a `with_node` method taking a `raft::server_id`
  which returns a `topology_node_mutation_builder&`.
- `topology_node_mutation_builder` creates the clustering row and allows
  seting its columns.

We'll use `topology_mutation_builder` when we only want to transition
the cluster-global topology state, without affecting any specific nodes'
states.
2023-05-08 16:47:11 +02:00
Kamil Braun
93dcdcd4eb raft topology: make new_cdc_generation_data_uuid topology-global
- make it a static column in `system.topology`
- move it from node-specific `ring_slice` to cluster-global `topology`

We will use it in scenarios where no node is transitioning.

Also make it `std::optional` in topology for consistency with other
fields (previously, the 'no value' state for this field was represented
using default-constructed `utils::UUID`).
2023-05-08 16:46:14 +02:00
Kamil Braun
aba31ad06c storage_service: use seastar::format instead of fmt::format
For some reason Scylla crashes on `aarch64` in release mode when calling
`fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this
line:
```
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id()));
```

I found this in our configure.py:
```
def get_clang_inline_threshold():
    if args.clang_inline_threshold != -1:
        return args.clang_inline_threshold
    elif platform.machine() == 'aarch64':
        # we see miscompiles with 1200 and above with format("{}", uuid)
        # also coroutine miscompiles with 600
        return 300
    else:
        return 2500
```
but reducing it to `0` didn't help.

I managed to get the following backtrace (with inline threshold 0):
```
void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751
 (inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728
 (inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255
void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const*, wchar_t const*) at ??:?
void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:?
fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:?
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206
 (inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572
```

Maybe it's a bug in `fmt` library?

In any case replacing the call with `::format` (i.e. `seastar::format`
from seastar/core/print.hh) helps.

Do it for the entire file for consistency (and avoiding this bug).

Also, for the future, replace `format` calls with `::format` - now it's
the same thing, but the latter won't clash with `std::format` once we
switch to libstdc++13.

Fixes #13707

Closes #13711
2023-05-05 19:23:22 +02:00
Tomasz Grabiec
e385ce8a2b Merge "fix stack use after free during shutdown" from Gleb
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.

Fixes #13522
2023-05-04 15:14:18 +02:00
Gleb Natapov
e9fb885e82 service/raft: raft_group0: drop dependency on cdc::generation_service
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
2023-05-04 13:03:07 +03:00
Tomasz Grabiec
aba5667760 Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.

This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.

Closes #13690

* github.com:scylladb/scylladb:
  raft topology: rename `update_replica_state` -> `update_topology_state`
  raft topology: remove `transition_state::normal`
  raft topology: switch on `transition_state` first
  raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
  raft topology: parse replaced node in `exec_global_command`
  raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
  storage_service: extract raft topology coordinator fiber to separate class
  raft topology: rename `replication_state` to `transition_state`
  raft topology: make `replication_state` a topology-global state
2023-04-30 10:55:24 +02:00
Asias He
a8040306bb storage_service: Fix removing replace node as pending
Consider

- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2

  ```
  auto host_id = _gossiper.get_host_id(endpoint);
  auto existing = tmptr->get_endpoint_for_host_id(host_id);
  ```

  host_id = new host id
  existing = empty

  As a result, del_replacing_endpoint() will not be called.

This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.

This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.

To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.

Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3

After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3

Tests: https://github.com/scylladb/scylla-dtest/pull/3126

Closes #13677
2023-04-27 21:03:01 +03:00
Kamil Braun
0bee872fb1 raft topology: rename update_replica_state -> update_topology_state
The new name is more generic and appropriate for topology transitions
which don't affect any specific replica but the entire cluster as a
whole (which we'll introduce later).

Also take `guard` directly instead of `node_to_work_on` in this more
generic function. Since we want `node_to_work_on` to die when we steal
its guard, introduce `take_guard` which takes ownership of the object
and returns the guard.
2023-04-27 15:22:19 +02:00
Kamil Braun
22ab5982e7 raft topology: remove transition_state::normal
What this state really represented is that there is currently no
transition. So remove it and make `transition_state` optional instead.
2023-04-27 15:18:32 +02:00
Kamil Braun
61c4e0ae20 raft topology: switch on transition_state first
Previously the code assumed that there was always a 'node to work on' (a
node which wants to change its state) or there was no work to do at all.
It would find such a node, switch on its state (e.g. check if it's
bootstrapping), and in some states switch on the topology
`transition_state` (e.g. check if it's `write_both_read_old`).

We want to introduce transitions that are not node-specific and can work
even when all nodes are 'normal' (so there's no 'node to work on'). As a
first step, we refactor the code so it switches on `transition_state`
first. In some of these states, like `write_both_read_old`, there must
be a 'node to work on' for the state to make sense; but later in some
states it will be optional (such as `commit_cdc_generation`).
2023-04-27 15:14:59 +02:00
Kamil Braun
a023ca2cf1 raft topology: handle_ring_transition: rename res to exec_command_res
A more descriptive name.
2023-04-27 15:12:12 +02:00
Kamil Braun
4ddfce8213 raft topology: parse replaced node in exec_global_command
Will make following commits easier.
2023-04-27 15:10:49 +02:00
Kamil Braun
bafce8fd28 raft topology: extract cleanup_group0_config_if_needed from get_node_to_work_on 2023-04-27 15:04:36 +02:00
Kamil Braun
98f69f52aa storage_service: extract raft topology coordinator fiber to separate class
The lambdas defined inside the fiber are now methods of this class.

Currently `handle_node_transition` is calling `handle_ring_transition`,
in a later commit we will reverse this: `handle_ring_transition` will
call `handle_node_transition`. We won't have to shuffle the functions
around because they are members of the same class, making the change
easier to review. In general, the code will be easier to maintain in
this new form (no need to deal with so many lambda captures etc.)

Also break up some lines which exceeded the 120 character limit (as per
Seastar coding guidelines).
2023-04-27 15:04:35 +02:00
Kamil Braun
defa63dc20 raft topology: rename replication_state to transition_state
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
2023-04-27 11:39:38 +02:00
Kamil Braun
af1ea2bb16 raft topology: make replication_state a topology-global state
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.

The `replication_state` column in `system.topology` is now `static`.

This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
2023-04-27 11:39:38 +02:00
Kamil Braun
30cc07b40d Merge 'Introduce tablets' from Tomasz Grabiec
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.

Things achieved in this PR:

  - You can start a cluster and create a keyspace whose tables will use
    tablet-based replication. This is done by setting `initial_tablets`
    option:

    ```
        CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
                        'replication_factor': 3,
                        'initial_tablets': 8};
    ```

    All tables created in such a keyspace will be tablet-based.

    Tablet-based replication is a trait, not a separate replication
    strategy. Tablets don't change the spirit of replication strategy, it
    just alters the way in which data ownership is managed. In theory, we
    could use it for other strategies as well like
    EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
    is augmented to support tablets.

  - You can create and drop tablet-based tables (no DDL language changes)

  - DML / DQL work with tablet-based tables

    Replicas for tablet-based tables are chosen from tablet metadata
    instead of token metadata

Things which are not yet implemented:

  - handling of views, indexes, CDC created on tablet-based tables
  - sharding is done using the old method, it ignores the shard allocated in tablet metadata
  - node operations (topology changes, repair, rebuild) are not handling tablet-based tables
  - not integrated with compaction groups
  - tablet allocator piggy-backs on tokens to choose replicas.
    Eventually we want to allocate based on current load, not statically

Closes #13387

* github.com:scylladb/scylladb:
  test: topology: Introduce test_tablets.py
  raft: Introduce 'raft_server_force_snapshot' error injection
  locator: network_topology_strategy: Support tablet replication
  service: Introduce tablet_allocator
  locator: Introduce tablet_aware_replication_strategy
  locator: Extract maybe_remove_node_being_replaced()
  dht: token_metadata: Introduce get_my_id()
  migration_manager: Send tablet metadata as part of schema pull
  storage_service: Load tablet metadata when reloading topology state
  storage_service: Load tablet metadata on boot and from group0 changes
  db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
  migration_notifier: Introduce before_drop_keyspace()
  migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
  test: perf: Introduce perf-tablets
  test: Introduce tablets_test
  test: lib: Do not override table id in create_table()
  utils, tablets: Introduce external_memory_usage()
  db: tablets: Add printers
  db: tablets: Add persistence layer
  dht: Use last_token_of_compaction_group() in split_token_range_msb()
  locator: Introduce tablet_metadata
  dht: Introduce first_token()
  dht: Introduce next_token()
  storage_proxy: Improve trace-level logging
  locator: token_metadata: Fix confusing comment on ring_range()
  dht, storage_proxy: Abstract token space splitting
  Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
  db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
  db: Introduce get_non_local_vnode_based_strategy_keyspaces()
  service: storage_proxy: Avoid copying keyspace name in write handler
  locator: Introduce per-table replication strategy
  treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
  locator: Introduce effective_replication_map
  locator: Rename effective_replication_map to vnode_effective_replication_map
  locator: effective_replication_map: Abstract get_pending_endpoints()
  db: Propagate feature_service to abstract_replication_strategy::validate_options()
  db: config: Introduce experimental "TABLETS" feature
  db: Log replication strategy for debugging purposes
  db: Log full exception on error in do_parse_schema_tables()
  db: keyspace: Remove non-const replication strategy getter
  config: Reformat
2023-04-27 09:40:18 +02:00
Tomasz Grabiec
ce94a2a5b0 Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun
Fix two issues with the replace operation introduced by recent PRs.

Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).

Fixes: #13651

Closes #13655

* github.com:scylladb/scylladb:
  test: new suite for testing raft-based topology
  test: remove topology_custom/test_custom.py
  raft topology: don't require new CDC generation UUID to always be present
  raft topology: include shard_count/ignore_msb during replace
2023-04-26 11:38:07 +02:00
Pavel Emelyanov
5cbc8fe2f9 system_keyspace: De-static save_local_supported_features()
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:04:54 +03:00
Pavel Emelyanov
dcf88b07a4 gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
This will make it possible to move the enabler to feature_service.cc

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:56:07 +03:00
Pavel Emelyanov
1ee04e4934 system_keyspace: De-static load_peer_features()
This makes use of feature_enabler::_sys_ks dependency and gets rid of
one more global qctx usage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:50:00 +03:00
Pavel Emelyanov
ac60d8afca gossiper: Enable features and register enabler from outside
It's a bit hairy. The maybe_enable_features() is called from two places
-- the feature_enabler upon notifications from gossiper and directory by
gossiper from wait_for_gossip_to_settle().

The _latter_ is called only when the wait_for_gossip_to_settle() is
called for the first time because of the _gossip_settled checks in it.
For the first time this method is called by storage_service when it
tries to join the ring (next it's called from main, but that's not of
interest here).

Next, despite feature_enabler is registered early -- when gossiper
instance is constructed by sharded<gossiper>::start() -- it checks for
the _gossip_settled to be true to take any actions.

Considering both -- calling maybe_enable_features() _and_ registering
enabler after storage_service's call to wait_for_gossip_to_settle()
doesn't break the code logic, but make further patching possible. In
particular, the feature_enabler will move to feature_service not to
pollute gossiper code with anything that's not gossiping.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:42:17 +03:00
Kefu Chai
5804eb6d81 storage_service: specialize fmt::formatter<storage_service::mode>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `storage_service::mode` without the help of `operator<<`.

the corresponding `operator<<()` for `storage_service::mode` is removed
in this change, as all its callers are now using fmtlib for formatting
now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13640
2023-04-25 14:20:57 +02:00