Commit Graph

1063 Commits

Author SHA1 Message Date
Benny Halevy
2c54d7a35a view, storage_proxy: carry effective_replication_map along with endpoints
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.

Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138

The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141

Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15142
2023-08-29 09:08:42 +03:00
Botond Dénes
47ce69e9bf Merge 'paxos_response_handler: carry effective replication map' from Benny Halevy
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Closes #15141

* github.com:scylladb/scylladb:
  storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
  storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
2023-08-28 11:42:38 +03:00
Benny Halevy
4a2e367e92 storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:45:13 +03:00
Benny Halevy
6af0b281a6 storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.

Fixes scylladb/scylladb#15147

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 10:43:40 +03:00
Benny Halevy
098dd5021a storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.

It will be used in the next patch to get a respective
effective_replication_map_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 10:43:23 +03:00
Benny Halevy
27c33015a5 storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
Return error in production rather than crashing
as in https://github.com/scylladb/scylladb/issues/15138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 08:59:38 +03:00
Tomasz Grabiec
0239ba4527 Merge 'fencing: handle counter_mutations' from Gusev Petr
In this PR we add proper fencing handling to the `counter_mutation` verb.

As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update.

In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`.

We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests.

Closes #14564

* github.com:scylladb/scylladb:
  counter_mutation: add fencing
  encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
  counter_mutation: add replica::exception_variant to signature
2023-08-01 12:41:22 +02:00
Benny Halevy
2902a4136f storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop
Add calls to `maybe_yield` in the per-range loops to prevent stalls
if the loop never yields.

Note: originally the stalls were detected in nested calls
to `query_partition_key_range_concurrent` (see #14008).
This series turned the tail-recursion into iteration,
but still the inner loop(s) never yield and do quite
a lot of computations - so they mioght stall when called
with a large number of ranges.

Fixes #14008

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:54:34 +03:00
Benny Halevy
8d5020b8f6 storage_proxy: query_partition_key_range_concurrent: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:43:33 +03:00
Benny Halevy
3c122a87b5 storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration
Update the function state and loop for the next ranges
instead of nesting it oneself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:43:33 +03:00
Benny Halevy
fd119469d8 storage_proxy: coroutinize query_partition_key_range
Prepare for coroutinizing query_partition_key_range_concurrent.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 08:22:24 +03:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Tomasz Grabiec
6d545b2f9e storage_service: Implement stream_tablet RPC
Performs streaming of data for a single tablet between two tablet
replicas. The node which gets the RPC is the receiving replica.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
7851694eaa locator: erm: Make get_endpoints_for_reading() always return read replicas
Just a simplification.

Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
2023-07-25 21:08:01 +02:00
Petr Gusev
116444a01b counter_mutation: add fencing
As for regular mutations, we do the check
twice in handle_counter_mutation, before
and after applying the mutations. The last
is important in case fence was moved while
we were handling the request - some post-fence
actions might have already happened at this
time, so we can't treat the request as successful.
For example, if topology change coordinator was
switching to write_both_read_new, streaming
might have already started and missed this update.

In mutate_counters we can use a single fencing_token
for all leaders, since all the erms are processed
without yields and should underneath share the
same token_metadata.

We don't pass fencing token for replication explicitly in
replicate_counter_from_leader since
mutate_counter_on_leader_and_replicate doesn't capture erm
and if the drain on the coordinator timed out the erm for
replication might be different and we should use the
corresponding (maybe the new one) topology version for
outgoing write replication requests. This delayed
replication is similar to any other background activity
(e.g. writing hints) - it takes the current erm and
the current token_metadata version for outgoing requests.
2023-07-25 12:10:03 +04:00
Petr Gusev
edbb5cbb5f encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
We will need it in later commit to return
exceptions from handle_counter_mutation.

We also add utils::Tuple concept restriction
for add_replica_exception_to_query_result
since its type parameters are always tuples.
2023-07-25 12:09:21 +04:00
Petr Gusev
f2cbdc7f18 counter_mutation: add replica::exception_variant to signature
We are going to add fencing for counter mutations,
this means handle_counter_mutation will sometimes throw
stale_topology_exception. RPC doesn't marshall exceptions
transparently, exceptions thrown by server are delivered
to the client as a general remote_verb_error, which is not
very helpful.

The common practice is to embed exceptions into handler
result type. In this commit we use already existing
exception_variant as an exception container. We mark
exception_variant with [[version]] attribute in the idl
file, this should handle the case when the old replica
(without exception_variant in the signature) is replying
to the new one.
2023-07-25 12:09:19 +04:00
Petr Gusev
5fb8da4181 hints: add fencing
In this commit we just pass a fencing_token
through hint_mutation RPC verb.

The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.

Closes #14580
2023-07-24 18:12:48 +02:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Pavel Emelyanov
d9ba8eb8df service/paxos: Add db::system_keyspace& argument to some methods
The paxos_state's .prepare(), .accept(), .learn() and .prune() methods
access system keyspace via its static methods. The only caller of those
(storage_proxy::remote) already has the sharded system k.s. reference
and can pass its .local() one as argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Pavel Emelyanov
b0b91bf5ec proxy/remote: Keep sharded<db::system_keyspace>& dependency
This dependency will be needed to call service::paxos_state:: calls and
all of them are done in storage_proxy::remote() methods only

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 17:36:42 +03:00
Patryk Jędrzejczak
56bd9b5db3 service: storage_proxy: do not report abort requests in handle_write
We don't want to report aborts in storage_proxy::handle_write, because it
can be only triggered by shutdowns and timeouts.

Before this change, such reports flooded logs when a drained node still
received the write RPCs.
2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak
f9db9f5943 service: storage_proxy: encode abort_requested_exception in handle_read
storage_proxy::handle_read now makes sure that abort_requested_exception
is encoded in a way that preserves its type information. This allows
the coordinator to properly deserialize and handle it.

Before this change, if a drained replica was still receiving the read
RPCs, it would flood the coordinator's logs with std::runtime_error
reports.
2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak
68bd0424c2 service: storage_proxy: refactor encode_replica_exception_for_rpc
To properly handle abort_requested_exception thrown from
migration_manager::get_schema_for_read in storage_proxy::handle_read (we
do in the next commit) we have to somehow encode and return it. The
encode_replica_exception_for_rpc function is not suitable for that because
it requires the SourceTuple type (of a value returned by do_query()) which
we don't know when calling get_schema_for_read.

We move the part of encode_replica_exception_for_rpc responsible for
handling exceptions to a new function and rewrite it in a way that doesn't
require the SourceTuple type. As this function fits the name
encode_replica_exception_for_rpc better, we name it this way and rename
the previous encode_replica_exception_for_rpc.
2023-07-17 12:27:33 +02:00
Patryk Jędrzejczak
a21c4abad7 replica: add abort_requested_exception to exception_variant
If migration_manager::get_schema_for_write is called after
migration_manager::drain, it throws abort_requested_exception.
This exception is not present in replica::exception_variant, which
means that RPC doesn't preserve information about its type. If it is
thrown on the replica side, it is deserialized as std::runtime_error
on the coordinator. Therefore, abstract_read_resolver::error logs
information about this exception, even though we don't want it (aborts
are triggered on shutdown and timeouts).

To solve this issue, we add abort_requested_exception to
replica::exception_variant and, in the next commits, refactor
storage_proxy::handle_read so that abort_requested_exception thrown in
migration_manager::get_schema_for_write is properly serialized. Thanks
to this change, unchanged abstract_read_resolver::error correctly
handles abort_requested_exception thrown on the replica side by not
reporting it.
2023-07-13 16:57:10 +02:00
Kefu Chai
5443bf69f7 storage_proxy: print the expected ex.what()
before this change, the format string contains two placeholders,
but only one extra argument is passed in. if we actually format
this logging message, fmtlib would throw.

after this change, we pass the exception's error message as yet
another argument.

this logging message is printed with "trace" level, guess that's
why we haven't have the exception thrown by fmtlib.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14628
2023-07-12 12:34:51 +03:00
Petr Gusev
1e851262f2 storage_proxy: handler responses, use pointers to default constructed values instead of nulls
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.

The problem was introduced in this commit
 2d791a5ed4. The
 function `encode_replica_exception_for_rpc` used
 `default_tuple_maker` callback to create tuples
 containing exceptions. Callers returned pointers
 to default-constructed values in this callback,
 e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
 The commit changed this to just `SourceTuple{}`,
 which means nullptr for pointer types.

Fixes: #14282

Closes #14352
2023-06-26 11:10:38 +03:00
Gleb Natapov
94fcba5662 storage_proxy: remove unused variable 2023-06-22 15:26:20 +03:00
Avi Kivity
e233f471b8 Merge 'Respect tablet shard assignment' from Tomasz Grabiec
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet

After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.

To make the life easier, support was added to obtain table pointer from the schema pointer:

```
schema_ptr s;
s->table().shard_of(...)
```

Closes #13939

* github.com:scylladb/scylladb:
  locator: network_topology_startegy: Allocate shards to tablets
  locator: Store node shard count in topology
  service: topology: Extract topology updating to a lambda
  test: Move test_tablets under topology_experimental
  sstables: Add trace-level logging related to shard calculation
  schema: Catch incorrect uses of schema::get_sharder()
  dht: Rename dht::shard_of() to dht::static_shard_of()
  treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
  storage_proxy: Avoid multishard reader for tablets
  storage_proxy: Obtain shard from erm in the read path
  db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
  forward_service: Use table sharder
  alternator: Use table sharder
  db: multishard: Obtain sharder from erm
  sstable_directory: Improve trace-level logging
  db: table: Introduce shard_of() helper
  db: Use table sharder in compaction
  sstables: Compute sstable shards using sharder from erm when loading
  sstables: Generate sharding metadata using sharder from erm when writing
  test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
  dht: Make split_range_to_single_shard() prepared for tablet sharder
  sstables: Move compute_shards_for_this_sstable() to load()
  dht: Take sharder externally in splitting functions
  locator: Make sharder accessible through effective_replication_map
  dht: sharder: Document guarantees about mapping stability
  tablets: Implement tablet sharder
  tablets: Include pending replica in get_shard()
  dht: sharder: Introduce next_shard()
  db: token_ring_table: Filter out tablet-based keyspaces
  db: schema: Attach table pointer to schema
  schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
  schema_registry: Make learn(schema_ptr) attach entry to the target schema
  test: lib: cql_test_env: Expose feature_service
  test: Extract throttle object to separate header
2023-06-21 10:20:41 +03:00
Calle Wilund
f18e967939 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294
2023-06-21 10:08:27 +03:00
Tomasz Grabiec
21198e8470 treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fb0bdcec0c storage_proxy: Avoid multishard reader for tablets
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.

The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
10e05eec66 storage_proxy: Obtain shard from erm in the read path
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Kamil Braun
732feca115 storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300
2023-06-20 07:56:40 +03:00
Petr Gusev
94605e4839 storage_proxy.cc: add fencing to read RPCs
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.

Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
2023-06-15 15:52:50 +04:00
Petr Gusev
4004ce1f44 storage_proxy.cc: extract handle_read
We continue the refactoring by introducing
the common implementation for all read methods.
2023-06-15 15:52:50 +04:00
Petr Gusev
2d791a5ed4 storage_proxy.cc: refactor encode_replica_exception_for_rpc
We are going to add fencing to read RPCs, it would be easier
to do it once for all three of them. This refactoring
enables this since it allows to use
encode_replica_exception_for_rpc for handle_read_digest.
2023-06-15 15:52:50 +04:00
Petr Gusev
6b115e902b storage_proxy: fix indentation 2023-06-15 15:52:50 +04:00
Petr Gusev
46f73fcaa6 storage_proxy: add fencing for mutation
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.

Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
2023-06-15 15:52:49 +04:00
Petr Gusev
d34da12240 storage_proxy: add fencing_token and related infrastructure
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.

An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
2023-06-15 15:48:00 +04:00
Kamil Braun
a740fbf58a storage_proxy: rename init_messaging_service to start_remote
The function now has more responsibilities than before, rename it
and add a comment to better illustrate this.
2023-06-14 11:41:36 +02:00
Kamil Braun
f26e98c3be storage_proxy: don't pass gossiper& and messaging_service& during initialization
These services are now passed during `init_messaging_service`, and
that's when the `remote` object is constructed.

The `remote` object is then destroyed in `uninit_messaging_service`.

Also, `migration_manager*` became `migration_manager&` in
`init_messaging_service`.
2023-06-14 11:41:36 +02:00
Kamil Braun
10f11b89ea storage_proxy: prepare for missing remote
Prepare the users of `remote` for the possibility that it's gone.

The `remote()` accessor throws an error if it's gone. Observe that
`remote()` is only used in places where it's verified that we really
want to send a message to a remote node, with a small exception:
`truncate_blocking`, which truncates locally by sending an RPC to
ourselves (and truncate always sends RPC to the whole cluster; we might
want to change this behavior in the future, see #11087). Other places
are easy to check (it's either implementations of `apply_remotely`
which is only called for remote nodes, or there's an `if` that checks
we don't apply the operation to ourselves).

There is one direct access to `_remote` which checks first if `_remote`
is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we
consider nodes other than us dead. Indeed, if `gossiper` is unavailable,
we didn't have a chance to gossip with other nodes and mark them alive.
2023-06-14 11:41:36 +02:00
Kamil Braun
8db1d75c6c storage_proxy: don't access remote during local queries in query_partition_key_range_concurrent
In `query_partition_key_range_concurrent` there's a calculation of cache
hit rates which requires accessing `gossiper` through `remote`. We want
to support local queries when `remote` is unavailable.

Check if it's a local query and only if not, fetch `gossiper` from `remote`.
2023-06-14 11:41:36 +02:00
Kamil Braun
ddcbade919 storage_proxy: don't access remote when calculating target replicas for local queries
We only want to access `remote` when it's necessary - when we're
performing a query that involves remote nodes. We want to support local
queries when `remote` (in particular, `gossiper&`) is unavailable.

Add a helper, `storage_proxy::filter_replicas_for_read`, which will
check if it's a local query and return early in that case without
accessing `remote`.
2023-06-14 11:41:34 +02:00
Kamil Braun
ff8d88a228 storage_proxy: introduce const version of remote()
One version is implemented using the other (with `const_cast`) because
some additional safety checks will be added in later commit.
2023-06-13 12:44:03 +02:00
Kamil Braun
c5c78a7922 storage_proxy: endpoint_filter: remove gossiper dependency
The function used `gossiper&` to check whether an endpoint is considered
alive. Abstract this out through `noncopyable_function`.

This will allow us to use `endpoint_filter` during local queries when
`remote` (which contains the `gossiper` reference) is unavailable.
2023-06-12 15:23:48 +02:00
Petr Gusev
3a88c7769f tracing::trace_info: pass by ref
sizeof(std::optional<tracing::trace_info>) == 64 bytes,
so it should be more efficient.
2023-05-30 14:32:10 +04:00