locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.
All the boost and topology tests pass with this change.
In this commit we switch the function
calculate_effective_replication_map to use the new
token_metadata. We do this by employing our new helper
calculate_natural_ips function. We can't use this helper for
current_endpoints/target_endpoints though,
since in that case we won't add the IP to the
pending_endpoints in the replace-with-same-ip scenario
The token_metadata_test is migrated to host_ids in the same
commit to make it pass. Other tests work because they fill
both versions of the token_metadata, but for this test it was
simpler to just migrate it straight away. The test constructs
the old token_metadata over the new token_metadata,
this means only the get_new() method will work on it. That's
why we also need to switch some other functions
(maybe_remove_node_being_replaced, do_get_natural_endpoints,
get_replication_factor) to the new version in the same commit.
All the boost and topology tests pass with this change.
We've updated all the places where token_metadata
is mutated, and now we can progress to the next stage
of the refactoring - gradually switching the read
code paths.
The calculate_natural_endpoints function
is at the core of all of them. It decides to what nodes
the given token should be replicated to for the given
token_metadata. It has a lot of usages in various contexts,
we can't switch them all in one commit, so instead we
allowed the function to behave in both ways. If
use_host_id parameter is false, the function uses the provided
token_metadata as is and returns endpoint_set as a result.
If it's true, it uses get_new() on the provided token_metadata
and returns host_id_set as a result.
The scope of the whole refactoring is limited to the erm data
structure, its interface will be kept inet_address based for now.
This means we'll often need to resolve host_ids to inet_address-es
as soon as we got a result from calculated_natural_endpoints.
A new calculate_natural_ips function is added for convenience.
It uses the new token_metadata and immediately resolves
returned host_id-s to inet_address-es.
The auxiliary declarations natural_ep_type, set_type, vector_type,
get_self_id, select_tm are introduced only for the sake of
migration, they will be removed later.
When we're replacing a node with the same IP address, we want
the following behavior:
* host_id -> IP mapping should work and return the same IP address for two
different host_ids - old and new.
* the IP -> host_id mapping should return the host_id of the old (replaced)
host.
This variant is most convenient for preserving the current behavior
of the code, especially the functions maybe_remove_node_being_replaced,
erm::get_natural_endpoints_without_node_being_replaced,
erm::get_pending_endpoints. The 'being_replaced' node will be properly removed in
maybe_remove_node_being_replaced and 'replacing' node will be added to
the pending_endpoints.
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.
We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.
If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
In this commit we enhance token_metadata with a pointer to the
new host_id-based generic_token_metadata specialisation (token_metadata2).
The idea is that in the following commits we'll go over all token_metadata
modifications and make the corresponding modifications to its new
host_id-based alternative.
The pointer to token_metadata2 is stored in the
generic_token_metadata::_new_value field. The pointer can be
mutable, immutable, or absent altogether (std::monostate).
It's mutable if this generic_token_metadata owns it, meaning
it was created using the generic_token_metadata(config cfg)
constructor. It's immutable if the
generic_token_metadata(lw_shared_ptr<const token_metadata2> new_value);
constructor was used. This means this old token_metadata is a wrapper for
new token_metadata and we can only use the get_new() method on it. The field
_new_value is empty for the new host_id-based token_metadata version.
The generic_token_metadata(std::unique_ptr<token_metadata_impl<NodeId>> impl, token_metadata2 new_value);
constructor is used for clone methods. We clone both versions,
and we need to pass a cloned token_metadata2 into constructor.
There are two overloads of get_new, for mutable and immutable
generic_token_metadata. Both of them throws an exception if
they can't get the appropriate pointer. There is also a
get_new_strong method, which returns an immutable owning
pointer. This is convenient since a lot of API's want an
owning pointer. We can't make the get_new/get_new_strong API
simpler and use get_new_strong everywhere since it mutate the
original generic_token_metadata by incrementing the reference
counter and this causes raises when it's passed between
shards in replicate_to_all_cores.
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.
generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.
topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.
pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.
generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.
Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.
There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.
In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.
In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.
In subsequent commits we'll need the following api for token_metadata:
token_metadata(token_metadata2_ptr);
get_new() -> token_metadata2*
where token_metadata2 is the new version of token_metadata,
based on host_id.
In other words:
* token_metadata knows the new version of itself and returns a pointer
to it through get_new()
* token_metadata can be constructed based solely on the new version,
without its own implementation. In this case the only method we can
use on it is get_new.
This allows to pass token_metadata2 to API's with token_metadata in method
signature, if these APIs are known to only use the get_new method on the
passed token_metadata.
And back to topology_change_info - if we got it from the new token_metadata
we want to be able to construct token_metadata from token_metadata2 contained
in it, and this requires it to be a ptr, not value.
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.
Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).
fixes#10036Closesscylladb/scylladb#16221
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.
Also will be used in tests to ensure tablets stick to expected locations.
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
and set broadcast_address / broadcast_rpc_address in main
to remove this dependency of snitch on fb_utilities.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that ec2_snitch::load_config is a coroutine
there's no need for a seastar thread here either.
Refs #16241
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
There is a need for sending tablet info to the drivers so they can be tablet aware. For the best performance we want to get this info lazily only when it is needed.
The info is send when driver asks about the information that the specific tablet contains and it is directed to the wrong node/shard so it could use that information for every subsequent query. If we send the query to the wrong node/shard, we want to send the RESULT message with additional information about the tablet (replicas and token range) in custom_payload.
Mechanism for sending custom_payload added.
Sending custom_payload tested using three node cluster and cqlsh queries. I used RF=1 so choosing wrong node was testable.
I also manually tested it with the python-driver and confirmed that the tablet info can be deserialized properly.
Automatic tests added.
Closesscylladb/scylladb#15410
* github.com:scylladb/scylladb:
docs: add documentation about sending tablet info to protocol extensions
Add tests for sending tablet info
cql3: send tablet if wrong node/shard is used during modification statement
cql3: send tablet if wrong node/shard is used during select statement
locator: add function to check locality
locator: add function to check if host is local
transport: add function to add tablet info to the result_message
transport: add support for setting custom payload
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
define token_metadata_ptr in token_metadata_fwd.hh
So that the declaration of `make_splitter` can be moved
to token_range_splitter.hh, where it belongs,
and so token_metadata.hh won't have to include it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.
The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.
This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
Will be used to synchronize long-running tablet operations with
topology coordinator.
It blocks barriers like erm_ptr, but refreshes if change is
irrelevant, so behaves as if the erm_ptr's scope was narrowed down to
a single tablet.
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15300
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.
Closes#15280
* github.com:scylladb/scylladb:
system_keyspace: Don't require snitch argument on start
system_keyspace: Don't cache local dc:rack pair
system_keyspace: Save local info with explicit location
storage_service: Get endpoint location from snitch, not system keyspace
snitch: Introduce and use get_location() method
repair: Local location variables instead of system keyspace's one
repair: Use full endpoint location instead of datacenter part
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.
Prepare for extending the token_metadata interface
to provide host_id in update_topology.
We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There are some places out there that generate locator::endpoint_dc_rack
pair out of snitch's get_datacenter() and get_rack() calls. Generalize
those with snitch's new method. It will also be used by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is too early to require that all nodes in normal state
have a non-null host_id.
The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.
In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.
Fixesscylladb/scylladb#15181
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15184
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.
Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793
Closes#14844
* github.com:scylladb/scylladb:
api: storage_service: improve description of /storage_service/host_id
token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.
The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.
Closes#14901
* github.com:scylladb/scylladb:
test_fencing: add test_fence_hints
test.py: output the skipped tests
test.py: add skip_mode decorator and fixture
test.py: add mode fixture
hints: add debug log for dropped hints
hints: send_one_hint: extend the scope of file_send_gate holder
pylib: add ScyllaMetrics
hints manager: add send_errors counter
token_metadata: add debug logs
fencing: add simple data plane test
random_tables.py: add counter column type
raft topology: don't increment version when transitioning to node_state::normal