When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.
Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138
The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141
Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15142
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixes scylladb/scylladb#15138
Closes#15141
* github.com:scylladb/scylladb:
storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixesscylladb/scylladb#15138
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.
Fixesscylladb/scylladb#15147
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.
It will be used in the next patch to get a respective
effective_replication_map_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In this PR we add proper fencing handling to the `counter_mutation` verb.
As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update.
In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`.
We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests.
Closes#14564
* github.com:scylladb/scylladb:
counter_mutation: add fencing
encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
counter_mutation: add replica::exception_variant to signature
Add calls to `maybe_yield` in the per-range loops to prevent stalls
if the loop never yields.
Note: originally the stalls were detected in nested calls
to `query_partition_key_range_concurrent` (see #14008).
This series turned the tail-recursion into iteration,
but still the inner loop(s) never yield and do quite
a lot of computations - so they mioght stall when called
with a large number of ranges.
Fixes#14008
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.
The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.
The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.
Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.
Closes#14601
* github.com:scylladb/scylladb:
tests: test_tablets: Add test for bootstraping a node
storage_service: topology_coordinator: Implement tablet migration state machine
tablets: Introduce tablet_mutation_builder
service: tablet_allocator: Introduce tablet load balancer
tablets: Introduce tablet_map::for_each_tablet()
topology: Introduce get_node()
token_metadata: Add non-const getter of tablet_metadata
storage_service: Notify topology state machine after applying schema change
storage_service: Implement stream_tablet RPC
tablets: Introduce global_tablet_id
stream_transfer_task, multishard_writer: Work with table sharder
tablets: Turn tablet_id into a struct
db: Do not create per-keyspace erm for tablet-based tables
tablets: effective_replication_map: Take transition stage into account when computing replicas
tablets: Store "stage" in transition info
doc: Document tablet migration state machine and load balancer
locator: erm: Make get_endpoints_for_reading() always return read replicas
storage_service: topology_coordinator: Sleep on failure between retries
storage_service: topology_coordinator: Simplify coordinator loop
main: Require experimental raft to enable tablets
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from
- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests
All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx
Closes#14778
* github.com:scylladb/scylladb:
system_keyspace: Make save_truncation_record() non-static
code: Pass sharded<db::system_keyspace>& to database::truncate()
db: Add sharded<system_keyspace>& to legacy_schema_migrator
Just a simplification.
Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
As for regular mutations, we do the check
twice in handle_counter_mutation, before
and after applying the mutations. The last
is important in case fence was moved while
we were handling the request - some post-fence
actions might have already happened at this
time, so we can't treat the request as successful.
For example, if topology change coordinator was
switching to write_both_read_new, streaming
might have already started and missed this update.
In mutate_counters we can use a single fencing_token
for all leaders, since all the erms are processed
without yields and should underneath share the
same token_metadata.
We don't pass fencing token for replication explicitly in
replicate_counter_from_leader since
mutate_counter_on_leader_and_replicate doesn't capture erm
and if the drain on the coordinator timed out the erm for
replication might be different and we should use the
corresponding (maybe the new one) topology version for
outgoing write replication requests. This delayed
replication is similar to any other background activity
(e.g. writing hints) - it takes the current erm and
the current token_metadata version for outgoing requests.
We will need it in later commit to return
exceptions from handle_counter_mutation.
We also add utils::Tuple concept restriction
for add_replica_exception_to_query_result
since its type parameters are always tuples.
We are going to add fencing for counter mutations,
this means handle_counter_mutation will sometimes throw
stale_topology_exception. RPC doesn't marshall exceptions
transparently, exceptions thrown by server are delivered
to the client as a general remote_verb_error, which is not
very helpful.
The common practice is to embed exceptions into handler
result type. In this commit we use already existing
exception_variant as an exception container. We mark
exception_variant with [[version]] attribute in the idl
file, this should handle the case when the old replica
(without exception_variant in the signature) is replying
to the new one.
In this commit we just pass a fencing_token
through hint_mutation RPC verb.
The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.
Closes#14580
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from
- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The paxos_state's .prepare(), .accept(), .learn() and .prune() methods
access system keyspace via its static methods. The only caller of those
(storage_proxy::remote) already has the sharded system k.s. reference
and can pass its .local() one as argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This dependency will be needed to call service::paxos_state:: calls and
all of them are done in storage_proxy::remote() methods only
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We don't want to report aborts in storage_proxy::handle_write, because it
can be only triggered by shutdowns and timeouts.
Before this change, such reports flooded logs when a drained node still
received the write RPCs.
storage_proxy::handle_read now makes sure that abort_requested_exception
is encoded in a way that preserves its type information. This allows
the coordinator to properly deserialize and handle it.
Before this change, if a drained replica was still receiving the read
RPCs, it would flood the coordinator's logs with std::runtime_error
reports.
To properly handle abort_requested_exception thrown from
migration_manager::get_schema_for_read in storage_proxy::handle_read (we
do in the next commit) we have to somehow encode and return it. The
encode_replica_exception_for_rpc function is not suitable for that because
it requires the SourceTuple type (of a value returned by do_query()) which
we don't know when calling get_schema_for_read.
We move the part of encode_replica_exception_for_rpc responsible for
handling exceptions to a new function and rewrite it in a way that doesn't
require the SourceTuple type. As this function fits the name
encode_replica_exception_for_rpc better, we name it this way and rename
the previous encode_replica_exception_for_rpc.
If migration_manager::get_schema_for_write is called after
migration_manager::drain, it throws abort_requested_exception.
This exception is not present in replica::exception_variant, which
means that RPC doesn't preserve information about its type. If it is
thrown on the replica side, it is deserialized as std::runtime_error
on the coordinator. Therefore, abstract_read_resolver::error logs
information about this exception, even though we don't want it (aborts
are triggered on shutdown and timeouts).
To solve this issue, we add abort_requested_exception to
replica::exception_variant and, in the next commits, refactor
storage_proxy::handle_read so that abort_requested_exception thrown in
migration_manager::get_schema_for_write is properly serialized. Thanks
to this change, unchanged abstract_read_resolver::error correctly
handles abort_requested_exception thrown on the replica side by not
reporting it.
before this change, the format string contains two placeholders,
but only one extra argument is passed in. if we actually format
this logging message, fmtlib would throw.
after this change, we pass the exception's error message as yet
another argument.
this logging message is printed with "trace" level, guess that's
why we haven't have the exception thrown by fmtlib.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14628
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.
The problem was introduced in this commit
2d791a5ed4. The
function `encode_replica_exception_for_rpc` used
`default_tuple_maker` callback to create tuples
containing exceptions. Callers returned pointers
to default-constructed values in this callback,
e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
The commit changed this to just `SourceTuple{}`,
which means nullptr for pointer types.
Fixes: #14282Closes#14352
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.
Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
We are going to add fencing to read RPCs, it would be easier
to do it once for all three of them. This refactoring
enables this since it allows to use
encode_replica_exception_for_rpc for handle_read_digest.
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.
Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.
An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
These services are now passed during `init_messaging_service`, and
that's when the `remote` object is constructed.
The `remote` object is then destroyed in `uninit_messaging_service`.
Also, `migration_manager*` became `migration_manager&` in
`init_messaging_service`.
Prepare the users of `remote` for the possibility that it's gone.
The `remote()` accessor throws an error if it's gone. Observe that
`remote()` is only used in places where it's verified that we really
want to send a message to a remote node, with a small exception:
`truncate_blocking`, which truncates locally by sending an RPC to
ourselves (and truncate always sends RPC to the whole cluster; we might
want to change this behavior in the future, see #11087). Other places
are easy to check (it's either implementations of `apply_remotely`
which is only called for remote nodes, or there's an `if` that checks
we don't apply the operation to ourselves).
There is one direct access to `_remote` which checks first if `_remote`
is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we
consider nodes other than us dead. Indeed, if `gossiper` is unavailable,
we didn't have a chance to gossip with other nodes and mark them alive.
In `query_partition_key_range_concurrent` there's a calculation of cache
hit rates which requires accessing `gossiper` through `remote`. We want
to support local queries when `remote` is unavailable.
Check if it's a local query and only if not, fetch `gossiper` from `remote`.
We only want to access `remote` when it's necessary - when we're
performing a query that involves remote nodes. We want to support local
queries when `remote` (in particular, `gossiper&`) is unavailable.
Add a helper, `storage_proxy::filter_replicas_for_read`, which will
check if it's a local query and return early in that case without
accessing `remote`.
The function used `gossiper&` to check whether an endpoint is considered
alive. Abstract this out through `noncopyable_function`.
This will allow us to use `endpoint_filter` during local queries when
`remote` (which contains the `gossiper` reference) is unavailable.