Currently, opt_st overrides the internal `changed` flag
by setting it with the opt_st changed status.
Instead, it should use `|=` to keep it true if it is already so.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13502
Don't maintain a "shadow" endpoint_to_host_id_map in token_metadata_impl.
Instead, get the nodes_by_endpoint map from topology
and use it to build the endpoint_to_host_id_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a simple node state model with:
`joining`, `normal`, `leaving`, and `left` states
to help managing nodes during replace
with the the same ip address.
Later on, this could also help prevent nodes
that were decommissioned, removed, or replaced
from rejoining the cluster.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep per node information (idx, host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
idx, host_id, endpoint, current/pending, per dc, per dc/rack.
The node index is a shorthand identifier for the node.
node* and index are valid while the respective topology instance is valid.
To be used, the caller must hold on to the topology / token_metadata object
(e.g. via a token_metadata_ptr or effective_replication_map)
Refs #6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
topology: add node idx
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define get_location() that gets the location
for the local node, and use either this entry point
or get_location(inet_address) to get the respective
dc or rack.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service::replicate_to_all_cores has a sophisticated way
to mutate the token_metadata and effective_replication_map
on shard 0 and cloning those to all other shards, applying
the changes only mutate and clone succeeded on all shards
so we don't end up with only some of the shards with the mutated
copy if an error happend mid-way (and then we would need to
roll-back the change for exception safety).
shared_token_metadata::mutate_token_metadata is currently only called from
a unit test that needs to mutate the token metadata only on shard 0,
but a following patch will require doing that on all shards.
This change adds this capbility by enforcing the call to be
on shard 0m mutating the token_metdata into a temporary pending copy
and cloning it on all other shards. Only then, when all shard
succeeded, set the modified token_metadata on all shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the thread_local default_location out of
topology::get_location so it can be used elsewhere.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
turns out we are using static variables to register entries in
global registries, and these variables are not directly referenced,
so linker just drops them when linking the executables or shared
libraries. to address this problem, we just link the whole archive.
another option would be create a linker script or pass
--undefined=<symbol> to linker. neither of them is straightforward.
a helper function is introduced to do this, as we cannot use CMake
3.24 as yet.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`effective_replication_map` is not a base class of any other class. so
there is no need to mark any of its member function as `virtual`. this
change should address following waring from Clang:
```
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:205:9: error: delete called on non-final 'locator::effective_replication_map' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete value_ptr;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:202:9: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
dispose(static_cast<T*>(counter));
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:317:27: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
accessors<T>::dispose(_p);
^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:263:12: note: in instantiation of member function 'seastar::lw_shared_ptr<locator::effective_replication_map>::~lw_shared_ptr' requested here
return make_lw_shared<effective_replication_map>(std::move(rs), std::move(tmptr), std::move(replication_map), replication_factor);
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12992
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This function is called on the fast data path
from storage_proxy when sorting multiple endpoints
by proximity.
This change calculates numeric node diff metrics
based on each address proximity to a given node
(by <dc, rack, same node>) to eliminate logic
branches in the function and reduce its footprint.
based on objdump -d output, compare_endpoints
footprint was reduced by 58.5% (3632 / 8752 bytes)
with clang version 15.0.7 (Fedora 15.0.7-1.fc37)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a regression unit test for topology::compare_endpoint
before it's optimized in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, effective_replication_map::do_get_ranges accepts
a functor that traverses the natural endpoints of each token
to decide whether a token range should be returned or not.
This is done by copying the natural endpoints vector for
each token. However, other than special strategies like
everywhere and local, the functor can be called on the
precalculated inet_address_vector_replica_set in the
replication_map and there's no need to copy it for each call.
for_each_natural_endpoint_until passes a reference to the function
down to the abstract replication strategy to let it work either
on the precalculated inet_address_vector_replica_set or
on a ad-hoc vector prepared by the replication strategy.
The function returns stop_iteration::yes when a match or mismatch
are found, or stop_iteration::no while it has no definite result.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12737
get_address_ranges() and get_ranges() perform almost the same computation.
They return the same ranges -- the only difference is that
get_address_ranges() returns them in unspecified order, while get_ranges()
returns them in sorted order. Therefore the result of get_ranges() is also
a valid result for get_address_ranges(), and the two functions can be unified
to avoid code duplication. This patch does just that.
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
since format_to() is defined included by both fmt and std namepaces,
without specifying which one to use, we'd fail to build with the
standard library which implements std::format_to(). yes, we are
`using namespace std` somewhere.
this change should address the FTBFS with GCC-13.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Most of snitch drivers set _my_dc and _my_rack with direct assignment
thus skipping the sanity checks for dc/rack being empty. On other shards
they call set_my_dc_and_rack() helper which warns the empty value and
replaces it with some defaults.
It's better to use the helper on all shards in order to have the same
dc/rack values everywhere.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12524
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.
Empty zones response should be handled regardless.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12274
Now, with a44ca06906, is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12294
* github.com:scylladb/scylladb:
topology: get rid of pending state
topology: debug log update and remove endpoint
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.
That's not nice, add checks for requests finish with http OK statuses.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12287
Now, with a44ca06906,
is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep
all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`describe_ring()` was implemented as a method of `storage_service`. This
patch extracts it from there to a standalone helper function in
`locator/util.hh`.
Recent changes in topology restricted the get_dc/get_rack calls. Older
code was trying to locate the endpoint in gossiper, then in system
keyspace cache and if the endpoint was not found in both -- returned
"default" location.
New code generates internal error in this case. This approach already
helped to spot several BUGs in code that had been eventually fixed, but
echoes of that change still pop up.
This patch relaxes the "missing endpoint" case by printing a warning in
logs and returning back the "default" location like old code did.
tests: update_cluster_layout_tests.py::*
hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_rebalance
bootstrap_test.py::TestBootstrap::test_decommissioned_wiped_node_can_join
bootstrap_test.py::TestBootstrap::test_failed_bootstap_wiped_node_can_join
materialized_views_test.py::TestMaterializedViews::test_decommission_node_during_mv_insert_4_nodes
refs: #11900
refs: #12054fixes: #11870
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12067
Returns an unordered set of datacenter names
to be used by network_topology_replication_strategy
and for ks_prop_defs.
The set is kept in sync with _dc_endpoints.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12023
Since commit a980f94 (token_metadata: impl: keep the set of normal token
owners as a member), we have a set, _normal_token_owners, which contains
all the nodes in the ring.
We can use _normal_token_owners to check if a node is part of the ring
directly instead of going through the _toplogy indirectly.
Fixes#11935
update_normal_tokens is the way to add a new node into the ring. We
should not require a new node to already be in the ring to be able to
add it to the ring. The current code works accidentally because
is_member is checking if a node is in the topology
We should use _topology.has_endpoint to check if a node is part of the
topology explicitly.
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780Closes#11942
* github.com:scylladb/scylladb:
message: messaging_service: check for known topology before calling is_same_dc/rack
test: reenable test_topology::test_decommission_node_add_column
test/pylib: util: configurable period in wait_for
message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
message: messaging_service: topology independent connection settings for GOSSIP verbs
As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization.
This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation.
While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield.
Fixes#11993Closes#11994
* github.com:scylladb/scylladb:
repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
repair: pass effective_replication_map down to repair_info
repair: coroutinize sync_data_using_repair
repair: futurize do_repair_start
effective_replication_map: add global_effective_replication_map
shared_token_metadata: get_lock is const
repair: sync_data_using_repair: require to run on shard 0
repair: require all node operations to be called on shard 0
repair: repair_info: keep effective_replication_map
repair: do_repair_start: use keyspace erm to get keyspace local ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
repair: do_repair_start: check_in_shutdown first
repair: get_db().local() where needed
repair: get topology from erm/token_metdata_ptr
view: get_view_natural_endpoint: get topology from erm
This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh}
and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987
While at it, the token_metadata functions are turned into coroutines
and topology copy constructor is deleted. The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology.
Closes#12001
* github.com:scylladb/scylladb:
locator: refactor topology out of token_metadata
locator: add types.hh
topology: delete copy constructor
token_metadata: coroutinize clone functions
An unordered_set is more efficient and there is no need
to return an ordered set for this purpose.
This change facilitates a follow-up change of adding
topology::get_datacenters(), returning an unordered_set
of datacenter names.
Refs #11987
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12003
Class to hold a coherent view of a keyspace
effective replication map on all shards.
To be used in a following patch to pass the sharded
keyspace e_r_m:s to repair.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Topology is copied only from token_metadata_impl::clone_only_token_map
which copies the token_metadata_impl with yielding to prevent reactor
stalls. This should apply to topology as well, so
add a clone_gently function for cloning the topology
from token_metadata_impl::clone_only_token_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.
Closes#11846
* github.com:scylladb/scylladb:
messaging_service: Deny putting INADD_ANY as preferred ip
messaging_service: Toss preferred ip cache management
gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
gossiping_property_file_snitch: Make _listen_address optional
To be used for specifying nodes either by their
host_id or ip address and using the token_metadata
to resolve the mapping.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>