When allocating tablets for table for the frist time their initial count
is calculated so that each shard in a cluster gets one tablet. It may
happen that more than one initial tablet per shard is better, e.g. perf
tests typically rely on that.
It's possible to specify the initial tablets count when creating a
keyspace, this number doesn't take the cluster topology into
consideration and may also be not very nice.
As a temporary solution (e.g. for perf tests) we may add a configurable
that scales the initial number of calculated tablets by some factor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When calculating per-DC tablets the formula is shards_in_dc / rf_in_dc,
but the denominator in it can be configured to be literally zero and the
division doesn't work.
Fix by assuming zero tablets for dcs with zero rf
fixes: #16844
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16861
This patch changes the syntax of enabling tablets from
CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }
to be
CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }
and updates all tests accordingly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If user configured zero initial tablets (spoiler: or this value was set
automagically when enabling tablets begind the scenes) we still need
some value to start with and this patch calculates one.
The math is based on topology and RF so that all shards are covered:
initial_tablets = max(nr_shards_in(dc) / RF_in(dc) for dc in datacenters)
The estimation is done when a table is created, not when the keyspace is
created. For that, the keyspace is configured with zero initial tabled,
and table-creation time zero is converted into auto-estimated value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a function that returns all nodes that have vnode been moved to them
during a topology change operation. Needed to know which nodes need to
do cleanup in case of failed topology change operation.
Some topology change operations causes some nodes loose ranges. This
information is needed to know which nodes need to do cleanup after
topology operation completes. Pre calculate it during erm creation.
This is an optimisation - for_each_natural_endpoint_until is
called only for vnode tokens, we don't need to run the
binary search for it in tm.first_token.
Also the function is made private since it's only used
in erm itself.
Before this patch the host_id -> IP mapping was done
in calculate_effective_replication_map. This function
is called from mutate_token_metadata, which means we
have to have an IP for each host_id in topology_state_load,
otherwise we get an error. We are going to remove
the IP waiting loop from topology_state_load, so
we need to get rid of IPs resolution from
calculate_effective_replication_map.
In this patch we move the host_id -> IP resolution to
the data plane. When a write or read request is sent
the target endpoints are requested from erm through
get_natural_endpoints_without_node_being_replaced,
get_pending_endpoints and get_endpoints_for_reading
methods and this is where the IP resolution
will now occur.
In the next patches we are going to change erm data structures
(replication_map and ring_mapping) from IP to host_id. Having
locator::host_id instead of IP in has_pending_ranges arguments
makes this transition easier.
The strategy constructor prints the dc:rf at the end making the sstring
for it by hand. Modern fmt-based logger can format unordered_map-s on
its own. The message would look slightly different though:
Configured datacenter replicas are: foo:1 bar:2
into
Configured datacenter replicas are: {"foo": 1, "bar": 2}
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16443
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.
In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The option is kepd in DDL, but is _not_ stored in
system_schema.keyspaces. Instead, it's removed from the provided options
and kept in scylla_keyspaces table in its own column. All the places
that had optional initial_tablets disengaged now set this value up the
way the find appropriate.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all the callers have it at hands (spoiler: not yet initialized, but
still) so the params can also have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter method is the one that will need extended params in next
patches. It's called from network_topology_strategy() constructor which
already has params at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patch added params to r.s. classes' constructors, but callers
don't construct those directly, instead they use the create_r.s.()
wrapper. This patch adds params to the wrapper too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When replication strategy class is created caller parr const reference
on the config options which is, in turn, a map<string, string>. In the
future r.s. classes will need to get "scylla specific" info along with
legacy options and this patch prepares for that by passing more generic
params argument into constructor. Currently the only inhabitant of the
new params is the legacy options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The `config_options` arg had been used to initialize `_config_options`
field of the base abstract_replication_strategy class, so it's more
idiomatic to use the latter. Also it makes next patches simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we use the format string of
"Can't replace node {} with itself", but fail to include the host id as seastar::format()'s arguments. this fails the compile-time check of fmt, which is yet merged. so, if we really run into this problem, {fmt} would throw before the intended runtime_error is raised -- currently, seastar::log formats the logging messages at runtime, this is not intended.
in this change, we pass `existing_node`, so it can be formatted, and the
intended error message can be printed in log.
Refs 11a4908683
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16422
Make host_id parameter non-optional and
move it to the beginning of the arguments list.
Delete unused overloads of add_or_update_endpoint.
Delete unused overload of token_metadata::update_topology
with inet_address argument.
This used to work before in replace-with-same-ip scenario, but
with host_id-s it's no longer relevant.
base_token_metadata has been removed from topology_change_info
because the conditions needed for its creation
are no longer met.
In this commit we replace token_metadata with token_metadata2
in the erm interface and field types. To accommodate the change
some of strategy-related methods are also updated.
All the boost and topology tests pass with this change.
The token_metadata::get_normal_and_bootstrapping_token_to_endpoint_map
method was used only here. It's inlined in this
commit since it's too specific and incurs the overhead
of creating an intermediate map.
locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.
All the boost and topology tests pass with this change.
In this commit we switch the function
calculate_effective_replication_map to use the new
token_metadata. We do this by employing our new helper
calculate_natural_ips function. We can't use this helper for
current_endpoints/target_endpoints though,
since in that case we won't add the IP to the
pending_endpoints in the replace-with-same-ip scenario
The token_metadata_test is migrated to host_ids in the same
commit to make it pass. Other tests work because they fill
both versions of the token_metadata, but for this test it was
simpler to just migrate it straight away. The test constructs
the old token_metadata over the new token_metadata,
this means only the get_new() method will work on it. That's
why we also need to switch some other functions
(maybe_remove_node_being_replaced, do_get_natural_endpoints,
get_replication_factor) to the new version in the same commit.
All the boost and topology tests pass with this change.
We've updated all the places where token_metadata
is mutated, and now we can progress to the next stage
of the refactoring - gradually switching the read
code paths.
The calculate_natural_endpoints function
is at the core of all of them. It decides to what nodes
the given token should be replicated to for the given
token_metadata. It has a lot of usages in various contexts,
we can't switch them all in one commit, so instead we
allowed the function to behave in both ways. If
use_host_id parameter is false, the function uses the provided
token_metadata as is and returns endpoint_set as a result.
If it's true, it uses get_new() on the provided token_metadata
and returns host_id_set as a result.
The scope of the whole refactoring is limited to the erm data
structure, its interface will be kept inet_address based for now.
This means we'll often need to resolve host_ids to inet_address-es
as soon as we got a result from calculated_natural_endpoints.
A new calculate_natural_ips function is added for convenience.
It uses the new token_metadata and immediately resolves
returned host_id-s to inet_address-es.
The auxiliary declarations natural_ep_type, set_type, vector_type,
get_self_id, select_tm are introduced only for the sake of
migration, they will be removed later.
When we're replacing a node with the same IP address, we want
the following behavior:
* host_id -> IP mapping should work and return the same IP address for two
different host_ids - old and new.
* the IP -> host_id mapping should return the host_id of the old (replaced)
host.
This variant is most convenient for preserving the current behavior
of the code, especially the functions maybe_remove_node_being_replaced,
erm::get_natural_endpoints_without_node_being_replaced,
erm::get_pending_endpoints. The 'being_replaced' node will be properly removed in
maybe_remove_node_being_replaced and 'replacing' node will be added to
the pending_endpoints.
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.
We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
It's a bug to use get_host_id on a non-existent endpoint,
so on_internal_error is more appropriate. Also, it's
easier to debug since it provides a backtrace.
If a missing inet_address is expected, get_host_id_if_known
should be used instead. We update one such case in
storage_service::force_remove_completion. Other
usages of get_host_id are correct.
In this commit we enhance token_metadata with a pointer to the
new host_id-based generic_token_metadata specialisation (token_metadata2).
The idea is that in the following commits we'll go over all token_metadata
modifications and make the corresponding modifications to its new
host_id-based alternative.
The pointer to token_metadata2 is stored in the
generic_token_metadata::_new_value field. The pointer can be
mutable, immutable, or absent altogether (std::monostate).
It's mutable if this generic_token_metadata owns it, meaning
it was created using the generic_token_metadata(config cfg)
constructor. It's immutable if the
generic_token_metadata(lw_shared_ptr<const token_metadata2> new_value);
constructor was used. This means this old token_metadata is a wrapper for
new token_metadata and we can only use the get_new() method on it. The field
_new_value is empty for the new host_id-based token_metadata version.
The generic_token_metadata(std::unique_ptr<token_metadata_impl<NodeId>> impl, token_metadata2 new_value);
constructor is used for clone methods. We clone both versions,
and we need to pass a cloned token_metadata2 into constructor.
There are two overloads of get_new, for mutable and immutable
generic_token_metadata. Both of them throws an exception if
they can't get the appropriate pointer. There is also a
get_new_strong method, which returns an immutable owning
pointer. This is convenient since a lot of API's want an
owning pointer. We can't make the get_new/get_new_strong API
simpler and use get_new_strong everywhere since it mutate the
original generic_token_metadata by incrementing the reference
counter and this causes raises when it's passed between
shards in replicate_to_all_cores.
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.
generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.
topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.
pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.
generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.
Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.
There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.
In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.
In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.