Commit Graph

40184 Commits

Author SHA1 Message Date
Petr Gusev
b6fbbe28aa storage_service: topology_state_load: fill new token_metadata
For each inet_address-based modification of token_metadata we
make a corresponding host_id-based change in token_metadata->get_new().

The _gossiper.add_saved_endpoint logic is switched to the new token_metadata.
2023-12-11 12:51:34 +04:00
Piotr Dulikowski
e7e1c4e63c storage_service: adjust update_topology_change_info to update new token_metadata
Both versions of the token_metadata need to be updated. For
the new version we provide a dc_rack_fn function which looks
for dc_rack by host_id in topology_state_machine if raft
topology is on. Otherwise, it looks for IP for the given
host_id and falls back to the gossiper-based function
get_dc_rack_for.
2023-12-11 12:51:34 +04:00
Petr Gusev
66c30e4f8e topology: set self host_id on the new topology
With this commit, we begin the next stage of the
refactoring - updating the new version of the token_metadata
in all places where the old version is currently being updated.

In this commit we assign host_id of this node, both in main.cc
and in boost tests.
2023-12-11 12:51:34 +04:00
Petr Gusev
e4253776a1 locator::topology: allow being_replaced and replacing nodes to have the same IP
When we're replacing a node with the same IP address, we want
the following behavior:
  * host_id -> IP mapping should work and return the same IP address for two
  different host_ids - old and new.
  * the IP -> host_id mapping should return the host_id of the old (replaced)
  host.
This variant is most convenient for preserving the current behavior
of the code, especially the functions maybe_remove_node_being_replaced,
erm::get_natural_endpoints_without_node_being_replaced,
erm::get_pending_endpoints. The 'being_replaced' node will be properly removed in
maybe_remove_node_being_replaced and 'replacing' node will be added to
the pending_endpoints.
2023-12-11 12:51:34 +04:00
Petr Gusev
5a1418fdba token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.

We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
2023-12-11 12:51:34 +04:00
Petr Gusev
08b47d645a token_metadata: get_host_id: exception -> on_internal_error
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.

If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
2023-12-11 12:51:34 +04:00
Petr Gusev
39bbe5f457 token_metadata: add get_all_ips method
This is convenient for migrating code that uses
get_all_endpoints.
2023-12-11 12:51:34 +04:00
Petr Gusev
9edf0709e6 token_metadata: support host_id-based version
In this commit we enhance token_metadata with a pointer to the
new host_id-based generic_token_metadata specialisation (token_metadata2).
The idea is that in the following commits we'll go over all token_metadata
modifications and make the corresponding modifications to its new
host_id-based alternative.

The pointer to token_metadata2 is stored in the
generic_token_metadata::_new_value field. The pointer can be
mutable, immutable, or absent altogether (std::monostate).
It's mutable if this generic_token_metadata owns it, meaning
it was created using the generic_token_metadata(config cfg)
constructor. It's immutable if the
generic_token_metadata(lw_shared_ptr<const token_metadata2> new_value);
constructor was used. This means this old token_metadata is a wrapper for
new token_metadata and we can only use the get_new() method on it. The field
_new_value is empty for the new host_id-based token_metadata version.

The generic_token_metadata(std::unique_ptr<token_metadata_impl<NodeId>> impl, token_metadata2 new_value);
constructor is used for clone methods. We clone both versions,
and we need to pass a cloned token_metadata2 into constructor.

There are two overloads of get_new, for mutable and immutable
generic_token_metadata. Both of them throws an exception if
they can't get the appropriate pointer. There is also a
get_new_strong method, which returns an immutable owning
pointer. This is convenient since a lot of API's want an
owning pointer. We can't make the get_new/get_new_strong API
simpler and use get_new_strong everywhere since it mutate the
original generic_token_metadata by incrementing the reference
counter and this causes raises when it's passed between
shards in replicate_to_all_cores.
2023-12-11 12:51:34 +04:00
Petr Gusev
63f64f3303 token_metadata: make it a template with NodeId=inet_address/host_id
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.

generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.

topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.

pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.

generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.

Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.

There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
2023-12-11 12:51:34 +04:00
Petr Gusev
c9fbe3d377 locator: make dc_rack_fn a template
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
2023-12-11 12:51:33 +04:00
Piotr Dulikowski
5227b71363 locator/topology: add key_kind parameter
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.

In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.

In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.
2023-12-11 12:51:33 +04:00
Petr Gusev
2f137776c3 token_metadata: topology_change_info: change field types to token_metadata_ptr
In subsequent commits we'll need the following api for token_metadata:
  token_metadata(token_metadata2_ptr);
  get_new() -> token_metadata2*
where token_metadata2 is the new version of token_metadata,
based on host_id.

In other words:
* token_metadata knows the new version of itself and returns a pointer
to it through get_new()
* token_metadata can be constructed based solely on the new version,
without its own implementation. In this case the only method we can
use on it is get_new.

This allows to pass token_metadata2 to API's with token_metadata in method
signature, if these APIs are known to only use the get_new method on the
passed token_metadata.

And back to topology_change_info - if we got it from the new token_metadata
we want to be able to construct token_metadata from token_metadata2 contained
in it, and this requires it to be a ptr, not value.
2023-12-11 12:51:33 +04:00
Petr Gusev
f21f23483c token_metadata: drop unused method get_endpoint_to_token_map_for_reading 2023-12-11 12:51:22 +04:00
Alexander Turetskiy
f30b5473ab cql: Reject empty options while altering a keyspace
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.

Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).

fixes #10036

Closes scylladb/scylladb#16221
2023-12-10 17:44:35 +02:00
Kefu Chai
818343b57d build: build session.cc in CMake building system
this source file was added in d3d83869. so let's update cmake
as well.

sessions_tests was added in the same commit, so add it as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16344
2023-12-09 22:14:47 +02:00
Avi Kivity
d62a5fc60b Merge 'tools/scylla-nodetool: implement additional commands, part 5/N ' from Botond Dénes
This PR implements the following new nodetool commands:
* decomission
* rebuild
* removenode
* getlogginglevels
* setlogginglevel
* move
* refresh

All commands come with tests and all tests pass with both the new and the current nodetool implementations.

Refs: https://github.com/scylladb/scylladb/issues/15588

Closes scylladb/scylladb#16348

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement the refresh command
  tools/scylla-nodetool: implement the move command
  tools/scylla-nodetool: implement setlogginglevel command
  tools/sclla-sstable: implement the getlogginglevels command
  tools/scylla-nodetool: implement the removenode command
  tools/scylla-nodetool: implement the rebuild command
  tools/scylla-nodetool: implement the decommission command
2023-12-09 21:47:22 +02:00
Pavel Emelyanov
5e69415387 guardrails: Do not validate initial_tablets as replication factor
When checking replication strategy options the code assumes (and it's
stated in the preceeding code comment) that all options are replication
factors. Nowadays it's no longer so, the initial_tablets one is not
replication factor and should be skipped

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#16335
2023-12-09 15:56:41 +02:00
Botond Dénes
496459165e tools/scylla-nodetool: implement the refresh command 2023-12-08 08:58:16 -05:00
Botond Dénes
ad148a9dbc tools/scylla-nodetool: implement the move command
In the java nodetool, this command ends up calling an API endpoint which
just throws an exception saying moving tokens is not supported. So in
the native implementation we just throw an exception to the same effect
in scylla-nodetool itself.
2023-12-08 08:29:39 -05:00
Botond Dénes
58d3850da1 tools/scylla-nodetool: implement setlogginglevel command 2023-12-08 08:18:56 -05:00
Botond Dénes
3a8590e1af tools/sclla-sstable: implement the getlogginglevels command 2023-12-08 07:32:45 -05:00
Botond Dénes
c35ed794de tools/scylla-nodetool: implement the removenode command 2023-12-08 07:32:31 -05:00
Botond Dénes
9a484cb145 tools/scylla-nodetool: implement the rebuild command 2023-12-08 07:05:30 -05:00
Botond Dénes
ea62f7c848 tools/scylla-nodetool: implement the decommission command 2023-12-08 06:14:36 -05:00
Kefu Chai
893f319004 sstables: add formatter for index_consume_entry_context_state
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we

* move the enum definition out of the class and before the
  class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().

as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16204
2023-12-08 12:45:38 +02:00
Kurashkin Nikita
c071cd92b5 cql3:statement_restrictions.cc add more conditions to prevent "allow filtering" error to pop up in delete/update statements
Modified Cassandra tests to check for Scylla's error messages
Fixes #12474

Closes scylladb/scylladb#15811
2023-12-07 21:25:18 +02:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Avi Kivity
4b1ef00dbb Merge 'File stream for tablet preparation' from Asias He
This series adds preparation patches for file stream tablet implementation in enterprise branch. It minimizes the differences between those two branches.

Closes scylladb/scylladb#16297

* github.com:scylladb/scylladb:
  messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
  compaction_group_for_token: Handle minimum_token and maximum_token token
  serializer: Add temporary_buffer support
  cql_test_env: Allow messaging_service to start listen
2023-12-07 16:26:22 +02:00
Avi Kivity
ed2a9b8750 Merge 'Commitlog: Fix reading/writing position calculations and allocation size checks' from Calle Wilund
Fixes #16298

The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.

However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.

Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.

However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.

Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.

Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.

Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.

Fixes #16301

The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.

Closes scylladb/scylladb#16302

* github.com:scylladb/scylladb:
  commitlog: Fix allocation size check to take sector overhead into account.
  commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
2023-12-07 12:27:54 +02:00
Botond Dénes
fb9379edf1 test/cql-pytest: test_select_from_mutation_fragments: bump timeout for slow test
The test test_many_partitions is very slow, as it tests a slow scan over
a lot of partitions. This was observed to time out on the slower ARM
machines, making the test flaky. To prevent this, create an
extra-patient cql connection with a 10 minutes timeout for the scan
itself.

Fixes: #16145

Closes scylladb/scylladb#16303
2023-12-07 11:55:53 +02:00
Yaniv Kaul
862909ee4f Typos: fix typos in documentation
Using codespell, went over the docs and fixed some typos.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#16275
2023-12-07 11:10:17 +02:00
Anna Stuchlik
8b01cb7fb8 doc: set 5.4 as the latest stable version
This commit updates the configuration for
ScyllaDB documentation so that:
- 5.4 is the latest version.
- 5.4 is removed from the list of unstable versions.

It must be merged when ScyllaDB 5.4 is released.

No backport is required.

Closes scylladb/scylladb#16308
2023-12-07 10:04:26 +02:00
Calle Wilund
dba39b47bd commitlog: Fix allocation size check to take sector overhead into account.
Fixes #16301

The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
2023-12-07 07:36:27 +00:00
Calle Wilund
0d35c96ef4 commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
Fixes #16298

The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.

However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.

Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.

However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.

Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.

Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.

Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
2023-12-07 07:36:27 +00:00
Asias He
6beadab9e6 messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
They will be used to implement file stream for tablet in the future. Reserve
the verb ID.
2023-12-07 14:54:12 +08:00
Asias He
67cfa12c7d compaction_group_for_token: Handle minimum_token and maximum_token token
The following error was seen:

[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token

Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
2023-12-07 14:54:12 +08:00
Asias He
974b28a750 serializer: Add temporary_buffer support
It will be used by file stream for tablet.
2023-12-07 09:46:37 +08:00
Asias He
faaf58f62c cql_test_env: Allow messaging_service to start listen
This is needed for rpc calls to work in the tests. With this patch, by
default, messaging_service does not listen as it was before.

This is useful for file stream for tablet test.
2023-12-07 09:46:36 +08:00
Avi Kivity
92d61def57 Merge 'scylla_swap_setup: run error check before allocating swap and increase swap allocation speed' from Takuya ASADA
This patch fixes error check and speed up swap allocation.

Following patches are included:
 - scylla_swap_setup: run error check before allocating swap
   avoid create swapfile before running error check
 - scylla_swap_setup: use fallocate on ext4
   this inclease swap allocation speed on ext4

Closes scylladb/scylladb#12668

* github.com:scylladb/scylladb:
  scylla_swap_setup: use fallocate on ext4
  scylla_swap_setup: run error check before allocating swap
2023-12-06 21:40:10 +02:00
Avi Kivity
55dacb8480 Merge 'Generalize atomic sstables deletion' from Pavel Emelyanov
The current implementation starts in sstables_manager that gets the deletion function from storage which, in turn, should atomically do sst.unlink() over a list of sstables (s3 driver is still not atomic though #13567).

This PR generalizes the atomic deletion inside sstables_manager method and removes the atomic deletor function that nobody liked when it was introduced (#13562)

Closes scylladb/scylladb#16290

* github.com:scylladb/scylladb:
  sstables/storage: Drop atomic deleter
  sstables/storage: Reimplement atomic deletion in sstables_manager
  sstables/storage: Add prepare/complete skaffold for atomic deletion
2023-12-06 19:48:07 +02:00
Tomasz Grabiec
7d0f4c10a2 test: tablets: Add test for failed streaming being fenced away 2023-12-06 18:37:01 +01:00
Tomasz Grabiec
083a0279a9 error_injection: Introduce poll_for_message()
To allow more complex waiting, which involves other exit conditions.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
ce0dc9e940 error_injection: Make is_enabled() public 2023-12-06 18:36:17 +01:00
Tomasz Grabiec
733eb21601 api: Add API to kill connection to a particular host
For testing failure scenarios.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
9dac0febce range_streamer: Do not block topology change barriers around streaming
Streaming was keeping effective_replication_map_ptr around the whole
process, which blocks topology change barriers.

This will inhibit progress of tablet load balancer or concurrent
migrations, resulting in worse performance.

Fix by switching to the most recent erm on sharder
calls. multishard_writer calls shard_of() for each new partition.

A better way would be to switch immediately when topology version
changes, but this is left for later.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
c228f2c940 range_streamer, tablets: Do not keep token metadata around streaming
It holds back global token metadata barrier during streaming, which
limits parallelism of load balancing.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
7a59acf248 tablets: Fail gracefully when migrating tablet has no pending replica
Before the patch we SIGSEGV trying to access pending replica in this
case. Fail early instead.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
31c995332c storage_service, raft topology: Run streaming under session topology guard
Prevents stale streaming operation from running beyond topology
operation they were started in. After the session field is cleared, or
changed to something else, the old topology_guard used by streaming is
interrupted and fenced and the next barrier will join with any
remaining work.
2023-12-06 18:36:17 +01:00