Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.
This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
With tablet-based replication strategies it will represent replication
of a single table.
Current vnode_effective_replication_map can be adapted to this interface.
This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Closes#13527
* github.com:scylladb/scylladb:
locator/topology: call seastar::current_backtrace only when log_level is enabled
schema_tables: call seastar::current_backtrace only when log_level is enabled
Fixes a problem when raft-based topology is enabled, which loads
topology from storage. It starts by clearing topology and then adding
nodes one by one. Before this patch, this violates internal invariant
of topology object which puts the local node as the first node. This
would manifest by triggering an assert in topology::pop_node() which
throws if popping the node at index 0 in order to keep the information
about local node around. This is normally prevented by a check in
topology::remove_node() which avoid calling pop_node() if removing the
local node. But since there is no node which is marked as local, this
check allows the first node to be popped.
To fix the problem I lift the invariant that local node is always in
_nodes. We still have information about local node in config. Instead
of keeping it in _nodes, we recognize it as part of indexing. We also
allow removing the local node like a regular node.
The path which reloads topology works correctly after this, the local
node will be recognized when (if) it is added to the topology.
Fixes#13495Closes#13498
* github.com:scylladb/scylladb:
locator: topology: Fix move assignment
locator: topology: Add printer
tests: topology: Test that topology clearing preserves information about local node
locator: topology: Recognize local node as part of indexing it
locator: topology: Fix get_location(ep) for local node
locator: topology: Fix typo
locator: topology: Preserve config when cloning
this change replaces all occurrences of `boost::lexical_cast<std::string>`
in the source tree with `fmt::to_string()`. for couple reasons:
* `boost::lexical_cast<std::string>` is longer than `fmt::to_string()`,
so the latter is easier to parse and read.
* `boost::lexical_cast<std::string>` creates a stringstream under the
hood, so it can use the `operator<<` to stringify the given object.
but stringstream is known to be less performant than fmtlib.
* we are migrating to fmtlib based formatting, see #13245. so
using `fmt::to_string()` helps us to remove yet another dependency
on `operator<<`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13611
Fixes a problem when raft-based topology is enabled, which loads
topology from storage. It starts by clearing topology and then adding
nodes one by one. Before this patch, this violates internal invariant
of topology object which puts the local node as the first node. This
would manifest by triggering an assert in topology::pop_node() which
throws if popping the node at index 0 in order to keep the information
about local node around. This is normally prevented by a check in
topology::remove_node() which avoid calling pop_node() if removing the
local node. But since there is no node which is marked as local, this
check allows the first node to be popped.
To fix the problem I lift the invariant that local node is always in
_nodes. We still have information about local node in config. Instead
of keeping it in _nodes, we recognize it as part of indexing. We also
allow removing the local node like a regular node.
The path which reloads topology works correctly after this, the local
node will be recognized when (if) it is added to the topology.
Fixes#13495
topology config may designate a different node than
get_broadcast_address() as local node. In particular, some tests don't
designate any node as the local node, which leads to logic errors
where current get_location(ep) for ep which happens to have the
address 127.0.0.1 returns location of the first node in _nodes rather
than ep.
Fix by looking up in _nodes first and fall back to local node if it's
equal to configured local node (if any).
Config is separate from state of the topology (nodes it
contains). Preserving the config will make it easier in later patches
to maintain invariants for cloned instances.
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Current if index_node throws when trying to
add an already indexed node, pop_node might
unindex the existing node instead of the new one.
Instead, with this change, unindex_node looks up
the node by its pointer and removed it from the
index map only if it's found there so to clean up
safely after index_node throws (at any stage).
Add a unit test to verify that.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, opt_st overrides the internal `changed` flag
by setting it with the opt_st changed status.
Instead, it should use `|=` to keep it true if it is already so.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13502
Don't maintain a "shadow" endpoint_to_host_id_map in token_metadata_impl.
Instead, get the nodes_by_endpoint map from topology
and use it to build the endpoint_to_host_id_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a simple node state model with:
`joining`, `normal`, `leaving`, and `left` states
to help managing nodes during replace
with the the same ip address.
Later on, this could also help prevent nodes
that were decommissioned, removed, or replaced
from rejoining the cluster.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep per node information (idx, host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
idx, host_id, endpoint, current/pending, per dc, per dc/rack.
The node index is a shorthand identifier for the node.
node* and index are valid while the respective topology instance is valid.
To be used, the caller must hold on to the topology / token_metadata object
(e.g. via a token_metadata_ptr or effective_replication_map)
Refs #6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
topology: add node idx
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define get_location() that gets the location
for the local node, and use either this entry point
or get_location(inet_address) to get the respective
dc or rack.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service::replicate_to_all_cores has a sophisticated way
to mutate the token_metadata and effective_replication_map
on shard 0 and cloning those to all other shards, applying
the changes only mutate and clone succeeded on all shards
so we don't end up with only some of the shards with the mutated
copy if an error happend mid-way (and then we would need to
roll-back the change for exception safety).
shared_token_metadata::mutate_token_metadata is currently only called from
a unit test that needs to mutate the token metadata only on shard 0,
but a following patch will require doing that on all shards.
This change adds this capbility by enforcing the call to be
on shard 0m mutating the token_metdata into a temporary pending copy
and cloning it on all other shards. Only then, when all shard
succeeded, set the modified token_metadata on all shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the thread_local default_location out of
topology::get_location so it can be used elsewhere.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
turns out we are using static variables to register entries in
global registries, and these variables are not directly referenced,
so linker just drops them when linking the executables or shared
libraries. to address this problem, we just link the whole archive.
another option would be create a linker script or pass
--undefined=<symbol> to linker. neither of them is straightforward.
a helper function is introduced to do this, as we cannot use CMake
3.24 as yet.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`effective_replication_map` is not a base class of any other class. so
there is no need to mark any of its member function as `virtual`. this
change should address following waring from Clang:
```
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:205:9: error: delete called on non-final 'locator::effective_replication_map' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete value_ptr;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:202:9: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
dispose(static_cast<T*>(counter));
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:317:27: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
accessors<T>::dispose(_p);
^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:263:12: note: in instantiation of member function 'seastar::lw_shared_ptr<locator::effective_replication_map>::~lw_shared_ptr' requested here
return make_lw_shared<effective_replication_map>(std::move(rs), std::move(tmptr), std::move(replication_map), replication_factor);
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12992
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This function is called on the fast data path
from storage_proxy when sorting multiple endpoints
by proximity.
This change calculates numeric node diff metrics
based on each address proximity to a given node
(by <dc, rack, same node>) to eliminate logic
branches in the function and reduce its footprint.
based on objdump -d output, compare_endpoints
footprint was reduced by 58.5% (3632 / 8752 bytes)
with clang version 15.0.7 (Fedora 15.0.7-1.fc37)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a regression unit test for topology::compare_endpoint
before it's optimized in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, effective_replication_map::do_get_ranges accepts
a functor that traverses the natural endpoints of each token
to decide whether a token range should be returned or not.
This is done by copying the natural endpoints vector for
each token. However, other than special strategies like
everywhere and local, the functor can be called on the
precalculated inet_address_vector_replica_set in the
replication_map and there's no need to copy it for each call.
for_each_natural_endpoint_until passes a reference to the function
down to the abstract replication strategy to let it work either
on the precalculated inet_address_vector_replica_set or
on a ad-hoc vector prepared by the replication strategy.
The function returns stop_iteration::yes when a match or mismatch
are found, or stop_iteration::no while it has no definite result.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12737
get_address_ranges() and get_ranges() perform almost the same computation.
They return the same ranges -- the only difference is that
get_address_ranges() returns them in unspecified order, while get_ranges()
returns them in sorted order. Therefore the result of get_ranges() is also
a valid result for get_address_ranges(), and the two functions can be unified
to avoid code duplication. This patch does just that.
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
since format_to() is defined included by both fmt and std namepaces,
without specifying which one to use, we'd fail to build with the
standard library which implements std::format_to(). yes, we are
`using namespace std` somewhere.
this change should address the FTBFS with GCC-13.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Most of snitch drivers set _my_dc and _my_rack with direct assignment
thus skipping the sanity checks for dc/rack being empty. On other shards
they call set_my_dc_and_rack() helper which warns the empty value and
replaces it with some defaults.
It's better to use the helper on all shards in order to have the same
dc/rack values everywhere.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12524
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.
Empty zones response should be handled regardless.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12274