excise is called from handle_state_left, the endpoint
may have already been removed from tm by then -
test_raft_upgrade_majority_loss fails if we use
unconditional tmptr->get_new()->get_host_id
instead of get_host_id_if_known
For each inet_address-based modification of token_metadata we
make a corresponding host_id-based change in token_metadata->get_new().
The _gossiper.add_saved_endpoint logic is switched to the new token_metadata.
Both versions of the token_metadata need to be updated. For
the new version we provide a dc_rack_fn function which looks
for dc_rack by host_id in topology_state_machine if raft
topology is on. Otherwise, it looks for IP for the given
host_id and falls back to the gossiper-based function
get_dc_rack_for.
With this commit, we begin the next stage of the
refactoring - updating the new version of the token_metadata
in all places where the old version is currently being updated.
In this commit we assign host_id of this node, both in main.cc
and in boost tests.
When we're replacing a node with the same IP address, we want
the following behavior:
* host_id -> IP mapping should work and return the same IP address for two
different host_ids - old and new.
* the IP -> host_id mapping should return the host_id of the old (replaced)
host.
This variant is most convenient for preserving the current behavior
of the code, especially the functions maybe_remove_node_being_replaced,
erm::get_natural_endpoints_without_node_being_replaced,
erm::get_pending_endpoints. The 'being_replaced' node will be properly removed in
maybe_remove_node_being_replaced and 'replacing' node will be added to
the pending_endpoints.
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.
We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.
If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
In this commit we enhance token_metadata with a pointer to the
new host_id-based generic_token_metadata specialisation (token_metadata2).
The idea is that in the following commits we'll go over all token_metadata
modifications and make the corresponding modifications to its new
host_id-based alternative.
The pointer to token_metadata2 is stored in the
generic_token_metadata::_new_value field. The pointer can be
mutable, immutable, or absent altogether (std::monostate).
It's mutable if this generic_token_metadata owns it, meaning
it was created using the generic_token_metadata(config cfg)
constructor. It's immutable if the
generic_token_metadata(lw_shared_ptr<const token_metadata2> new_value);
constructor was used. This means this old token_metadata is a wrapper for
new token_metadata and we can only use the get_new() method on it. The field
_new_value is empty for the new host_id-based token_metadata version.
The generic_token_metadata(std::unique_ptr<token_metadata_impl<NodeId>> impl, token_metadata2 new_value);
constructor is used for clone methods. We clone both versions,
and we need to pass a cloned token_metadata2 into constructor.
There are two overloads of get_new, for mutable and immutable
generic_token_metadata. Both of them throws an exception if
they can't get the appropriate pointer. There is also a
get_new_strong method, which returns an immutable owning
pointer. This is convenient since a lot of API's want an
owning pointer. We can't make the get_new/get_new_strong API
simpler and use get_new_strong everywhere since it mutate the
original generic_token_metadata by incrementing the reference
counter and this causes raises when it's passed between
shards in replicate_to_all_cores.
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.
generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.
topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.
pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.
generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.
Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.
There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.
In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.
In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.
In subsequent commits we'll need the following api for token_metadata:
token_metadata(token_metadata2_ptr);
get_new() -> token_metadata2*
where token_metadata2 is the new version of token_metadata,
based on host_id.
In other words:
* token_metadata knows the new version of itself and returns a pointer
to it through get_new()
* token_metadata can be constructed based solely on the new version,
without its own implementation. In this case the only method we can
use on it is get_new.
This allows to pass token_metadata2 to API's with token_metadata in method
signature, if these APIs are known to only use the get_new method on the
passed token_metadata.
And back to topology_change_info - if we got it from the new token_metadata
we want to be able to construct token_metadata from token_metadata2 contained
in it, and this requires it to be a ptr, not value.
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.
Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).
fixes#10036Closesscylladb/scylladb#16221
this source file was added in d3d83869. so let's update cmake
as well.
sessions_tests was added in the same commit, so add it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16344
This PR implements the following new nodetool commands:
* decomission
* rebuild
* removenode
* getlogginglevels
* setlogginglevel
* move
* refresh
All commands come with tests and all tests pass with both the new and the current nodetool implementations.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#16348
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement the refresh command
tools/scylla-nodetool: implement the move command
tools/scylla-nodetool: implement setlogginglevel command
tools/sclla-sstable: implement the getlogginglevels command
tools/scylla-nodetool: implement the removenode command
tools/scylla-nodetool: implement the rebuild command
tools/scylla-nodetool: implement the decommission command
When checking replication strategy options the code assumes (and it's
stated in the preceeding code comment) that all options are replication
factors. Nowadays it's no longer so, the initial_tablets one is not
replication factor and should be skipped
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16335
In the java nodetool, this command ends up calling an API endpoint which
just throws an exception saying moving tokens is not supported. So in
the native implementation we just throw an exception to the same effect
in scylla-nodetool itself.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we
* move the enum definition out of the class and before the
class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().
as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16204
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
This series adds preparation patches for file stream tablet implementation in enterprise branch. It minimizes the differences between those two branches.
Closesscylladb/scylladb#16297
* github.com:scylladb/scylladb:
messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
compaction_group_for_token: Handle minimum_token and maximum_token token
serializer: Add temporary_buffer support
cql_test_env: Allow messaging_service to start listen
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Closesscylladb/scylladb#16302
* github.com:scylladb/scylladb:
commitlog: Fix allocation size check to take sector overhead into account.
commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
The test test_many_partitions is very slow, as it tests a slow scan over
a lot of partitions. This was observed to time out on the slower ARM
machines, making the test flaky. To prevent this, create an
extra-patient cql connection with a 10 minutes timeout for the scan
itself.
Fixes: #16145Closesscylladb/scylladb#16303
This commit updates the configuration for
ScyllaDB documentation so that:
- 5.4 is the latest version.
- 5.4 is removed from the list of unstable versions.
It must be merged when ScyllaDB 5.4 is released.
No backport is required.
Closesscylladb/scylladb#16308
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
The following error was seen:
[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token
Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
This is needed for rpc calls to work in the tests. With this patch, by
default, messaging_service does not listen as it was before.
This is useful for file stream for tablet test.
This patch fixes error check and speed up swap allocation.
Following patches are included:
- scylla_swap_setup: run error check before allocating swap
avoid create swapfile before running error check
- scylla_swap_setup: use fallocate on ext4
this inclease swap allocation speed on ext4
Closesscylladb/scylladb#12668
* github.com:scylladb/scylladb:
scylla_swap_setup: use fallocate on ext4
scylla_swap_setup: run error check before allocating swap
The current implementation starts in sstables_manager that gets the deletion function from storage which, in turn, should atomically do sst.unlink() over a list of sstables (s3 driver is still not atomic though #13567).
This PR generalizes the atomic deletion inside sstables_manager method and removes the atomic deletor function that nobody liked when it was introduced (#13562)
Closesscylladb/scylladb#16290
* github.com:scylladb/scylladb:
sstables/storage: Drop atomic deleter
sstables/storage: Reimplement atomic deletion in sstables_manager
sstables/storage: Add prepare/complete skaffold for atomic deletion