Commit Graph

689 Commits

Author SHA1 Message Date
Petr Gusev
ef534ac876 rebuild_with_repair, replace_with_repair: use new token_metadata
Just mechanical changes to the new token_metadata.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
93263bf9e7 bootstrap: use new token_metadata
Just mechanical changes to the new token_metadata.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
d9283bd025 tablets: switch to token_metadata2
locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
f5038f6c72 calculate_effective_replication_map: use new token_metadata
In this commit we switch the function
calculate_effective_replication_map to use the new
token_metadata. We do this by employing our new helper
calculate_natural_ips function. We can't use this helper for
current_endpoints/target_endpoints though,
since in that case we won't add the IP to the
pending_endpoints in the replace-with-same-ip scenario

The token_metadata_test is migrated to host_ids in the same
commit to make it pass. Other tests work because they fill
both versions of the token_metadata, but for this test it was
simpler to just migrate it straight away. The test constructs
the old token_metadata over the new token_metadata,
this means only the get_new() method will work on it. That's
why we also need to switch some other functions
(maybe_remove_node_being_replaced, do_get_natural_endpoints,
get_replication_factor) to the new version in the same commit.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
fe3c543c4e calculate_natural_endpoints: fix formatting 2023-12-12 23:19:53 +04:00
Petr Gusev
d5b4b02b28 abstract_replication_strategy: calculate_natural_endpoints: make it work with both versions of token_metadata
We've updated all the places where token_metadata
is mutated, and now we can progress to the next stage
of the refactoring - gradually switching the read
code paths.

The calculate_natural_endpoints function
is at the core of all of them. It decides to what nodes
the given token should be replicated to for the given
token_metadata. It has a lot of usages in various contexts,
we can't switch them all in one commit, so instead we
allowed the function to behave in both ways. If
use_host_id parameter is false, the function uses the provided
token_metadata as is and returns endpoint_set as a result.
If it's true, it uses get_new() on the provided token_metadata
and returns host_id_set as a result.

The scope of the whole refactoring is limited to the erm data
structure, its interface will be kept inet_address based for now.
This means we'll often need to resolve host_ids to inet_address-es
as soon as we got a result from calculated_natural_endpoints.
A new calculate_natural_ips function is added for convenience.
It uses the new token_metadata and immediately resolves
returned host_id-s to inet_address-es.

The auxiliary declarations natural_ep_type, set_type, vector_type,
get_self_id, select_tm are introduced only for the sake of
migration, they will be removed later.
2023-12-12 23:19:53 +04:00
Petr Gusev
e4253776a1 locator::topology: allow being_replaced and replacing nodes to have the same IP
When we're replacing a node with the same IP address, we want
the following behavior:
  * host_id -> IP mapping should work and return the same IP address for two
  different host_ids - old and new.
  * the IP -> host_id mapping should return the host_id of the old (replaced)
  host.
This variant is most convenient for preserving the current behavior
of the code, especially the functions maybe_remove_node_being_replaced,
erm::get_natural_endpoints_without_node_being_replaced,
erm::get_pending_endpoints. The 'being_replaced' node will be properly removed in
maybe_remove_node_being_replaced and 'replacing' node will be added to
the pending_endpoints.
2023-12-11 12:51:34 +04:00
Petr Gusev
5a1418fdba token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.

We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
2023-12-11 12:51:34 +04:00
Petr Gusev
08b47d645a token_metadata: get_host_id: exception -> on_internal_error
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.

If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
2023-12-11 12:51:34 +04:00
Petr Gusev
39bbe5f457 token_metadata: add get_all_ips method
This is convenient for migrating code that uses
get_all_endpoints.
2023-12-11 12:51:34 +04:00
Petr Gusev
9edf0709e6 token_metadata: support host_id-based version
In this commit we enhance token_metadata with a pointer to the
new host_id-based generic_token_metadata specialisation (token_metadata2).
The idea is that in the following commits we'll go over all token_metadata
modifications and make the corresponding modifications to its new
host_id-based alternative.

The pointer to token_metadata2 is stored in the
generic_token_metadata::_new_value field. The pointer can be
mutable, immutable, or absent altogether (std::monostate).
It's mutable if this generic_token_metadata owns it, meaning
it was created using the generic_token_metadata(config cfg)
constructor. It's immutable if the
generic_token_metadata(lw_shared_ptr<const token_metadata2> new_value);
constructor was used. This means this old token_metadata is a wrapper for
new token_metadata and we can only use the get_new() method on it. The field
_new_value is empty for the new host_id-based token_metadata version.

The generic_token_metadata(std::unique_ptr<token_metadata_impl<NodeId>> impl, token_metadata2 new_value);
constructor is used for clone methods. We clone both versions,
and we need to pass a cloned token_metadata2 into constructor.

There are two overloads of get_new, for mutable and immutable
generic_token_metadata. Both of them throws an exception if
they can't get the appropriate pointer. There is also a
get_new_strong method, which returns an immutable owning
pointer. This is convenient since a lot of API's want an
owning pointer. We can't make the get_new/get_new_strong API
simpler and use get_new_strong everywhere since it mutate the
original generic_token_metadata by incrementing the reference
counter and this causes raises when it's passed between
shards in replicate_to_all_cores.
2023-12-11 12:51:34 +04:00
Petr Gusev
63f64f3303 token_metadata: make it a template with NodeId=inet_address/host_id
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.

generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.

topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.

pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.

generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.

Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.

There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
2023-12-11 12:51:34 +04:00
Petr Gusev
c9fbe3d377 locator: make dc_rack_fn a template
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
2023-12-11 12:51:33 +04:00
Piotr Dulikowski
5227b71363 locator/topology: add key_kind parameter
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.

In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.

In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.
2023-12-11 12:51:33 +04:00
Petr Gusev
2f137776c3 token_metadata: topology_change_info: change field types to token_metadata_ptr
In subsequent commits we'll need the following api for token_metadata:
  token_metadata(token_metadata2_ptr);
  get_new() -> token_metadata2*
where token_metadata2 is the new version of token_metadata,
based on host_id.

In other words:
* token_metadata knows the new version of itself and returns a pointer
to it through get_new()
* token_metadata can be constructed based solely on the new version,
without its own implementation. In this case the only method we can
use on it is get_new.

This allows to pass token_metadata2 to API's with token_metadata in method
signature, if these APIs are known to only use the get_new method on the
passed token_metadata.

And back to topology_change_info - if we got it from the new token_metadata
we want to be able to construct token_metadata from token_metadata2 contained
in it, and this requires it to be a ptr, not value.
2023-12-11 12:51:33 +04:00
Petr Gusev
f21f23483c token_metadata: drop unused method get_endpoint_to_token_map_for_reading 2023-12-11 12:51:22 +04:00
Alexander Turetskiy
f30b5473ab cql: Reject empty options while altering a keyspace
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.

Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).

fixes #10036

Closes scylladb/scylladb#16221
2023-12-10 17:44:35 +02:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
5381792401 tablets: Add per-tablet session id field to tablet metadata
range_streamer will pick it up when creating topology_guard.

It's materialized in memory only for migrating tablets in
tablet_transition_info.
2023-12-06 18:36:17 +01:00
Botond Dénes
d2a88cd8de Merge 'Typos: fix typos in code' from Yaniv Kaul
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255

Closes scylladb/scylladb#16289

* github.com:scylladb/scylladb:
  Update unified/build_unified.sh
  Update main.cc
  Update dist/common/scripts/scylla-housekeeping
  Typos: fix typos in code
2023-12-06 07:36:41 +02:00
Yaniv Kaul
ae2ab6000a Typos: fix typos in code
Fixes some more typos as found by codespell run on the code.
In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255
2023-12-05 15:18:11 +02:00
Benny Halevy
4d461fc788 locator: replication strategies: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
86716b2048 locator: topology: add helpers to retrieve this host_id and address
And respective `is_me()` predicates,
to prepare for getting rid of fb_utilities.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
52412087b7 snitch: pass broadcast_address in snitch_config
To untangle snitch from fb_utilities.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
94fc8e2a9a snitch: add optional get_broadcast_address method
and set broadcast_address / broadcast_rpc_address in main
to remove this dependency of snitch on fb_utilities.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
1d0e71308b locator: ec2_multi_region_snitch: keep local public address as member
To be used in the next patch to retrieve the broadcast_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
90af71ffa7 ec2_multi_region_snitch: reindent load_config
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
fecb597ad6 ec2_multi_region_snitch: coroutinize load_config
Now that ec2_snitch::load_config is a coroutine
there's no need for a seastar thread here either.

Refs #16241

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
cb7e096a59 ec2_snitch: reindent load_config
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Benny Halevy
1c1a048d3f ec2_snitch: coroutinize load_config
Fixes #16241

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:48 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Tomasz Grabiec
b06a0078fb Merge 'Support for sending tablet info to the drivers' from Sylwia Szunejko
There is a need for sending tablet info to the drivers so they can be tablet aware. For the best performance we want to get this info lazily only when it is needed.

The info is send when driver asks about the information that the specific tablet contains and it is directed to the wrong node/shard so it could use that information for every subsequent query. If we send the query to the wrong node/shard, we want to send the RESULT message with additional information about the tablet (replicas and token range) in custom_payload.

Mechanism for sending custom_payload added.

Sending custom_payload tested using three node cluster and cqlsh queries. I used RF=1 so choosing wrong node was testable.

I also manually tested it with the python-driver and confirmed that the tablet info can be deserialized properly.

Automatic tests added.

Closes scylladb/scylladb#15410

* github.com:scylladb/scylladb:
  docs: add documentation about sending tablet info to protocol extensions
  Add tests for sending tablet info
  cql3: send tablet if wrong node/shard is used during modification statement
  cql3: send tablet if wrong node/shard is used during select statement
  locator: add function to check locality
  locator: add function to check if host is local
  transport: add function to add tablet info to the result_message
  transport: add support for setting custom payload
2023-11-22 17:44:07 +02:00
sylwiaszunejko
954d51389c locator: add function to check locality 2023-11-22 09:23:43 +01:00
sylwiaszunejko
a0c8531875 locator: add function to check if host is local 2023-11-21 15:15:20 +01:00
Kefu Chai
efd65aebb2 build: cmake: add check-header target
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.

we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15913
2023-11-13 10:27:06 +02:00
Benny Halevy
a1acf6854b everywhere: reduce dependencies on i_partitioner.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:47:44 +02:00
Benny Halevy
6de1cc2993 locator: resolve the dependency of token_metadata.hh on token_range_splitter.hh
define token_metadata_ptr in token_metadata_fwd.hh
So that the declaration of `make_splitter` can be moved
to token_range_splitter.hh, where it belongs,
and so token_metadata.hh won't have to include it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:01:29 +02:00
Tomasz Grabiec
d5539e080d tablets: Implement cleanup step
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.

The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.

This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
6a62aca3a9 locator: Introduce tablet_metadata_guard
Will be used to synchronize long-running tablet operations with
topology coordinator.

It blocks barriers like erm_ptr, but refreshes if change is
irrelevant, so behaves as if the erm_ptr's scope was narrowed down to
a single tablet.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
532ec84210 locator, replica: Add a way to wait for table's effective_replication_map change 2023-09-14 12:08:54 +02:00
Benny Halevy
7119c1d8cc token_metadata: update_topology: make endpoint_dc_rack arg optional
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15300
2023-09-11 16:16:19 +02:00
Botond Dénes
b062b245ad Merge 'Don't cache dc:rack on system keyspace local cache' from Pavel Emelyanov
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.

Closes #15280

* github.com:scylladb/scylladb:
  system_keyspace: Don't require snitch argument on start
  system_keyspace: Don't cache local dc:rack pair
  system_keyspace: Save local info with explicit location
  storage_service: Get endpoint location from snitch, not system keyspace
  snitch: Introduce and use get_location() method
  repair: Local location variables instead of system keyspace's one
  repair: Use full endpoint location instead of datacenter part
2023-09-11 10:26:26 +03:00
Benny Halevy
574c7e349a locator: topology: is_configured_this_node: delete spurious semicolumn
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Benny Halevy
115462be17 locator: topology: is_configured_this_node: compare host_id first
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.

Prepare for extending the token_metadata interface
to provide host_id in update_topology.

We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Pavel Emelyanov
d2bd203cba snitch: Introduce and use get_location() method
There are some places out there that generate locator::endpoint_dc_rack
pair out of snitch's get_datacenter() and get_rack() calls. Generalize
those with snitch's new method. It will also be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:52:30 +03:00
Benny Halevy
5afc242814 token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id
It is too early to require that all nodes in normal state
have a non-null host_id.

The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.

In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.

Fixes scylladb/scylladb#15181

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15184
2023-08-28 21:40:55 +03:00
Botond Dénes
e7af2a7de8 Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.

Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793

Closes #14844

* github.com:scylladb/scylladb:
  api: storage_service: improve description of /storage_service/host_id
  token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
2023-08-23 13:55:14 +03:00
Kamil Braun
cdc3cd2b79 Merge 'raft: add fencing tests' from Petr Gusev
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.

The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.

Closes #14901

* github.com:scylladb/scylladb:
  test_fencing: add test_fence_hints
  test.py: output the skipped tests
  test.py: add skip_mode decorator and fixture
  test.py: add mode fixture
  hints: add debug log for dropped hints
  hints: send_one_hint: extend the scope of file_send_gate holder
  pylib: add ScyllaMetrics
  hints manager: add send_errors counter
  token_metadata: add debug logs
  fencing: add simple data plane test
  random_tables.py: add counter column type
  raft topology: don't increment version when transitioning to node_state::normal
2023-08-22 16:28:21 +02:00