Tablet keyspaces have per/table range ownership, which cannot currently
be expressed in a DESC CLUSTER statement, which describes range
ownership in the current keyspace (if set). Until we figure out how to
represent range ownership (tablets) of all tables of a keyspace, we
disable range ownership for tablet keyspaces.
Fixes: #16483Closesscylladb/scylladb#16713
This change is intended to remove the dependency to
operator<<(std::ostream&, const std::unordered_set<seastar::sstring>&)
from test/boost/cql_auth_query_test.cc.
It prepares the test for removal of the templated helpers.
Such removal is one of goals of the referenced issue that is linked below.
Refs: #13245
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16758
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `auth::resource` and friends,
* update their callers of `operator<<` to use `fmt::print()`.
* drop `operator<<`, as they are not used anymore.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16765
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for data_value, but its
its operator<<() is preserved as we are still using the generic
homebrew formatter for formatting std::vector, which in turn uses
operator<< of the element type.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16767
because the CMake-generated build.ninja is located under build/,
and it puts the `scylla` executable at build/$CMAKE_BUILD_TYPE/scylla,
instead of at build/$scylla_build_mode/scylla, so let's adapt to this
change accordingly.
we will promote this change to a shared place if we have similar
needs in other tests as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16775
If the TABLETS map is missing in the CREATE KEYSPACE statement the
tablets are anyway enabled if the respective cluster feature is enabled.
To opt-out keyspaces one may use TABLETS = { 'enabled': false } syntax.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will make per-keyspace initial_tables option really
optional and turn tablets ON when the feature is ON. This will break all
other tests' assumptions, that they are testing vnodes replication. So
turn the feature off by default, tests that do need tables will need to
explicitly enable this feature on their own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In its current location it will be started with 3 pre-created scylla
nodes with default features ON. Next patch will exclude `tablets` from
the default list, so the test needs to create servers on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When started cql_test_env creates a test keyspace. Some tablets test
cases create a table in this keyspace, but misuse the whole feature. The
thing is that while tablets feature is ON in those test cases, the
keyspace itself doesn _not_ have the initial_tables option and thus
tablets are not enabled for the ks' table for real. Currently test cases
work just because this table is only used as a transparent table ID
placeholder. If turning on tablets for the keyspace, several test cases
would get broken for two reasons.
First, the tables map will no longer be empty on test start.
Second, applying changes to tablet metadata may not be visible, becase
test case uses "ranom" timestamp, that can be less that the initial
metadata mutations' timestamp.
This patch fixes all three places:
1. enables tables for the test keyspace
2. removes assumption that the initial metadata is empty
3. uses large enough timestamp for subsequent mutations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the user can do
CREATE KEYSPACE ... WITH TABLETS = { 'enabled': false }
to turn tablets off. It will be useful in the future to opt-out keyspace
from tablets when they will be turned on by default based on cluster
features only.
Also one can do just
CREATE KEYSPACE ... WITH TABLETS = { 'enabled': true }
and let Scylla select the initial tablets value by its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch changes the syntax of enabling tablets from
CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }
to be
CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }
and updates all tests accordingly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If user configured zero initial tablets (spoiler: or this value was set
automagically when enabling tablets begind the scenes) we still need
some value to start with and this patch calculates one.
The math is based on topology and RF so that all shards are covered:
initial_tablets = max(nr_shards_in(dc) / RF_in(dc) for dc in datacenters)
The estimation is done when a table is created, not when the keyspace is
created. For that, the keyspace is configured with zero initial tabled,
and table-creation time zero is converted into auto-estimated value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For correctness sstable cleanup has to run between (some) topology
changes. Sometimes even a failed topology change may require running
the cleanup. The series introduces automatic sstable cleanup step to the
topology change coordinator. Unlike other operations it is not represented
as a global transition state, but done by each node independently which
allows cleanup to run without locking the topology state machine so
tablet code can run in parallel with the cleanup.
It is done by having a cleanup state flag for each node in the
topology. The flag is a tri state: "clean" - the node is clean, "needed"
- cleanup is needed (but not running), "running" - cleanup is running. No
topology operation can proceed if there is a node in "running" state, but
some operation can proceed even if there are nodes in "needed" state. If
the coordinator needs to perform a topology operation that cannot run while
there are nodes that need cleanup the coordinator will start one
automatically and continue only after cleanup completes. There is also a
possibility to kick cleanup manually through the new RAFT API call.
* 'cleanup-needed-v8' of https://github.com/gleb-cloudius/scylla:
test: add test for automatic cleanup procedure
test: add test for topology requests queue management
storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator
storage_service: topology coordinator: add logging to removenode and decommission
storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
storage_service: topology coordinator: manage cluster cleanup as part of the topology management
storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter
test: use servers_see_each_other when needed
test: add servers_see_each_other helper
storage_service: topology coordinator: make topology coordinator lifecycle subscriber
system_keyspace: raft topology: load ignore nodes parameter together with removenode topology request
storage_service: topology coordinator: introduce sstable cleanup fiber
storage_proxy: allow to wait for all ongoing writes
storage_service: topology coordinator: mark nodes as needing cleanup when required
storage_service: add mark_nodes_as_cleanup_needed function
vnode_effective_replication_map: add get_all_pending_nodes() function
vnode_effective_replication_map: pre calculate dirty endpoints during topology change
raft topology: add cleanup state to the topology state machine
The test runs two bootstraps and checks that there is no cleanup
in between. Then it runs a decommission and checks that cleanup runs
automatically and then it runs one more decommission and checks that no
cleanup runs again. Second part checks manual cleanup triggering. It
adds a node, triggers cleanup through the REST API, checks that is runs,
decommissions a node and check that the cleanup did not run again.
This test creates a 5 node cluster with 2 down nodes (A and B). After
that it creates a queue of 3 topology operation: bootstrap, removenode
A and removenode B with ignore_nodes=A. Check that all operation
manage to complete. Then it downs one node and creates a queue with
two requests: bootstrap and decommission. Since none can proceed both
should be canceled.
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
Sometimes it is unsafe to start a new topology operation before cleanup
runs on dirty nodes. This patch detects the situation when the topology
operation to be executed cannot be run safely until all dirty nodes do
cleanup and initiates the cleanup automatically. It also waits for
cleanup to complete before proceeding with the topology operation.
There can be a situation that nodes that needs cleanup dies and will
never clear the flag. In this case if a topology operation that wants to
run next does not have this node in its ignore node list it may stuck
forever. To fix this the patch also introduces the "liveness aware"
request queue management: we do not simple choose _a_ request to run next,
but go over the queue and find requests that can proceed considering
the nodes liveness situation. If there are multiple requests eligible to
run the patch introduces the order based on the operation type: replace,
join, remove, leave, rebuild. The order is such so to not trigger cleanup
needlessly.
* seastar 0ffed835...8b9ae36b (4):
> net/posix: Track ap-server ports conflict
Fixes#16720
> include/seastar/core: do not include unused header
> build: expose flag like -std=c++20 via seastar.pc
> src: include used headers for C++ modules build
Closesscylladb/scylladb#16769
In the next patch we want to abort topology operations if there is no
enough live nodes to perform them. This will break tests that do a
topology operation right after restarting a node since a topology
coordinator may still not see the restarted node as alive. Fix all those
tests to wait between restart and a topology operation until UP state
propagates.
We want to change the coordinator to consider nodes liveness when
processing the topology operation queue. If there is no enough live
nodes to process any of the ops we want to cancel them. For that to work
we need to be able to kick the coordinator if liveness situation
changes.
Introduce a fiber that waits on a topology event and when it sees that
the node it runs on needs to perform sstable cleanup it initiates one
for each non tablet, non local table and resets "cleanup" flag back to
"clean" in the topology.
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
A cleanup needs to run when a node loses an ownership of a range (during
bootstrap) or if a range movement to an normal node failed (removenode,
decommission failure). Mark all dirty node as "cleanup needed" in those cases.
The function creates a mutation that sets cleanup to "needed" for each
normal node that, according to the erm, has data it does not own after
successful or unsuccessful topology operation.
Add a function that returns all nodes that have vnode been moved to them
during a topology change operation. Needed to know which nodes need to
do cleanup in case of failed topology change operation.
Some topology change operations causes some nodes loose ranges. This
information is needed to know which nodes need to do cleanup after
topology operation completes. Pre calculate it during erm creation.
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,
1. I believe this filter was never necessary, because the view update
code already silently ignores base updates which do not belong to
this replica (see get_view_natural_endpoint()). After all, the view
update needs to know that this replica is the Nth owner of the base
update to send its update to the Nth view replica, but if no such
N exists, no view update is sent.
2. The code introduced for that filter used a per-keyspace replication
map, which was ok for vnodes but no longer works for tablets, and
causes the operation using it to fail.
3. The filter was used every time the "view update generator" was used,
regardless of whether any cleanup is necessary or not, so every
such operation would fail with tablets. So for example the dtest
test_mvs_populating_from_existing_data fails with tablets:
* This test has view building in parallel with automatic tablet
movement.
* Tablet movement is streaming.
* When streaming happens before view building has finished, the
streamed sstables get "view update generator" run on them.
This causes the problematic code to be called.
Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.
Fixes#16598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "view update generator" is responsible for generating view updates
for staging sstables (such as coming from repair). If the processing
fails, the code retries - immediately. If there is some persistent bug,
such as issue #16598, we will have a tight loop of error messages,
potentially a gigabyte of identical messages every second.
In this patch we simply add a sleep of one second after view update
generation fails before retrying. We can still get many identical
error messages if there is some bug, but not more than one per second.
Refs #16598.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The loop in `id2ip` lambda makes problems if we are applying an old raft
log that contains long-gone nodes. In this case, we may never receive
the `IP` for a node and stuck in the loop forever. In this series we
replace the loop with an if - we just don't update the `host_id <-> ip`
mapping in the `token_metadata.topology` if we don't have an `IP` yet.
The PR moves `host_id -> IP` resolution to the data plane, now it
happens each time the IP-based methods of `erm` are called. We need this
because IPs may not be known at the time the erm is built. The overhead
of `raft_address_map` lookup is added to each data plane request, but it
should be negligible. In this PR `erm/resolve_endpoints` continues to
treat missing IP for `host_id` as `internal_error`, but we plan to relax
this in the follow-up (see this PR first comment).
Closesscylladb/scylladb#16639
* github.com:scylladb/scylladb:
raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater
gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
storage_service: topology_state_load: remove IP waiting loop
storage_service: sync_raft_topology_nodes: add target_node parameter
storage_service: sync_raft_topology_nodes: move loops to the end
storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node
storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node
storage_service: sync_raft_topology_nodes: move update_topology up
storage_service: topology_state_load: remove clone_async/clear_gently overhead
storage_service: fix indentation
storage_service: extract sync_raft_topology_nodes
storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
address_map: move gossiper subscription logic into storage_service
topology_coordinator: exec_global_command: small refactor, use contains + reformat
storage_service: wait_for_ip for new nodes
storage_service.idl.hh: fix raft_topology_cmd.command declaration
erm: for_each_natural_endpoint_until: use is_vnode == true
erm: switch the internal data structures to host_id-s
erm: has_pending_ranges: switch to host_id
When a node changes its IP we need to store the mapping in
system.peers and update token_metadata.topology and erm
in-memory data structures.
The test_change_ip was improved to verify this new
behaviour. Before this patch the test didn't check
that IPs used for data requests are updated on
IP change. In this commit we add the read/write check.
It fails on insert with 'node unavailable'
error without the fix.
The loop makes problems if we are applying an old
raft log that contains long-gone nodes. In this case, we may
never receive the IP for a node and stuck in the loop forever.
The idea of the patch is to replace the loop with an
if - we just don't update the host_id <-> ip mapping
in the token_metadata.topology if we don't have an IP yet.
When we get the mapping later, we'll call
sync_raft_topology_nodes again from
gossiper_state_change_subscriber_proxy.
If it's set, instead of going over all the nodes in raft topology,
the function will update only the specified node. This parameter
will be used in the next commit, in the call to sync_raft_topology_nodes
from gossiper_state_change_subscriber_proxy.