tablets, mv: disable self-pairing when tablets are used

A write to a base table can generate one or more writes to a materialized
view. The write to RF base replicas need to cause writes to RF view
replicas. Our MV implementation, based on Cassandra's implementation,
does this via "pairing": Each one of the base replicas involved in this
write sends each view update to exactly one view replica. The function
get_view_natural_endpoint() tells a base replica which of the view
replicas it should send the update to.

The standard pairing is based on the ring order: The first owner of the
base token sends to the first owner of the view token, the second to the
second, and so on. However, the existing code also uses an optimization
we call self-pairing: If a single node is both a base replica and a base
replica, the pairing is modified so this node sends the update to itself.

This patch *disables* the self-pairing optimization in keyspaces that
use tablets:

The self-pairing optimization can cause the pairing to change after
token ranges are moved between nodes, so it can break base-view consistency
in some edge cases, leading to "ghost rows". With tablets, these range
movements become even more frequent - they can happen even if the
cluster doesn't grow.  This is why we want to solve this problem for tablets.

For backward compatibility and to avoid sudden inconsistencies emerging
during upgrades, we decided to continue using the self-pairing optimization
for keyspaces that are *not* using tablets (i.e., using vnoodes).

Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to
override these defaults - i.e., we don't provide a way to disable
self-pairing with vnodes or to enable them with tablets. We could introduce
such a schema flag later, if we ever want to (and I'm not sure we want to).

It's important to note, that in some cases, this change has implications
on when view updates become synchronous, in the tablets case.
For example:

  * If we have 3 nodes and RF=3, with the self-pairing optimization each
    node is paired with itself, the view update is local, and is
    implicitly synchronous (without requiring a "synchronous_updates"
    flag).
  * In the same setup with tablets, without the self-pairing optimization
    (due to this patch), this is not guaranteed. Some view updates may not
    be synchronous, i.e., the base write will not wait for the view
    write. If the user really wants synchronous updates, they should
    be requested explicitly, with the "synchronous_updates" view option.

Fixes #16260.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16272
This commit is contained in:
Nadav Har'El
2023-12-04 11:22:04 +02:00
committed by Avi Kivity
parent f483309165
commit 300e549267

View File

@@ -1536,6 +1536,14 @@ bool needs_static_row(const mutation_partition& mp, const std::vector<view_and_b
// of this function is to find, assuming that this node is one of the base
// replicas for a given partition, the paired view replica.
//
// In the past, we used an optimization called "self-pairing" that if a single
// node was both a base replica and a view replica for a write, the pairing is
// modified so that this node would send the update to itself. This self-
// pairing optimization could cause the pairing to change after view ranges
// are moved between nodes, so currently we only use it if
// use_legacy_self_pairing is set to true. When using tablets - where range
// movements are common - it is strongly recommended to set it to false.
//
// If the keyspace's replication strategy is a NetworkTopologyStrategy,
// we pair only nodes in the same datacenter.
// If one of the base replicas also happens to be a view replica, it is
@@ -1550,7 +1558,8 @@ get_view_natural_endpoint(
const locator::effective_replication_map_ptr& view_erm,
bool network_topology,
const dht::token& base_token,
const dht::token& view_token) {
const dht::token& view_token,
bool use_legacy_self_pairing) {
auto& topology = base_erm->get_token_metadata_ptr()->get_topology();
auto my_address = topology.my_address();
auto my_datacenter = topology.get_datacenter();
@@ -1562,6 +1571,7 @@ get_view_natural_endpoint(
}
for (auto&& view_endpoint : view_erm->get_natural_endpoints(view_token)) {
if (use_legacy_self_pairing) {
// If this base replica is also one of the view replicas, we use
// ourselves as the view replica.
if (view_endpoint == my_address) {
@@ -1577,12 +1587,19 @@ get_view_natural_endpoint(
} else if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
view_endpoints.push_back(view_endpoint);
}
} else {
if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
view_endpoints.push_back(view_endpoint);
}
}
}
assert(base_endpoints.size() == view_endpoints.size());
auto base_it = std::find(base_endpoints.begin(), base_endpoints.end(), my_address);
if (base_it == base_endpoints.end()) {
// This node is not a base replica of this key, so we return empty
// FIXME: This case shouldn't happen, and if it happens, a view update
// would be lost. We should reported or count this case.
return {};
}
return view_endpoints[base_it - base_endpoints.begin()];
@@ -1638,7 +1655,13 @@ future<> view_update_generator::mutate_MV(
auto view_ermp = mut.s->table().get_effective_replication_map();
auto& ks = _proxy.local().local_db().find_keyspace(mut.s->ks_name());
bool network_topology = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy());
auto target_endpoint = get_view_natural_endpoint(base_ermp, view_ermp, network_topology, base_token, view_token);
// We set legacy self-pairing for old vnode-based tables (for backward
// compatibility), and unset it for tablets - where range movements
// are more frequent and backward compatibility is less important.
// TODO: Maybe allow users to set use_legacy_self_pairing explicitly
// on a view, like we have the synchronous_updates_flag.
bool use_legacy_self_pairing = !ks.get_replication_strategy().uses_tablets();
auto target_endpoint = get_view_natural_endpoint(base_ermp, view_ermp, network_topology, base_token, view_token, use_legacy_self_pairing);
auto remote_endpoints = view_ermp->get_pending_endpoints(view_token);
auto sem_units = pending_view_updates.split(mut.fm.representation().size());