tablets, mv: disable self-pairing when tablets are used

A write to a base table can generate one or more writes to a materialized view. The write to RF base replicas need to cause writes to RF view replicas. Our MV implementation, based on Cassandra's implementation, does this via "pairing": Each one of the base replicas involved in this write sends each view update to exactly one view replica. The function get_view_natural_endpoint() tells a base replica which of the view replicas it should send the update to. The standard pairing is based on the ring order: The first owner of the base token sends to the first owner of the view token, the second to the second, and so on. However, the existing code also uses an optimization we call self-pairing: If a single node is both a base replica and a base replica, the pairing is modified so this node sends the update to itself. This patch *disables* the self-pairing optimization in keyspaces that use tablets: The self-pairing optimization can cause the pairing to change after token ranges are moved between nodes, so it can break base-view consistency in some edge cases, leading to "ghost rows". With tablets, these range movements become even more frequent - they can happen even if the cluster doesn't grow. This is why we want to solve this problem for tablets. For backward compatibility and to avoid sudden inconsistencies emerging during upgrades, we decided to continue using the self-pairing optimization for keyspaces that are *not* using tablets (i.e., using vnoodes). Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to override these defaults - i.e., we don't provide a way to disable self-pairing with vnodes or to enable them with tablets. We could introduce such a schema flag later, if we ever want to (and I'm not sure we want to). It's important to note, that in some cases, this change has implications on when view updates become synchronous, in the tablets case. For example: * If we have 3 nodes and RF=3, with the self-pairing optimization each node is paired with itself, the view update is local, and is implicitly synchronous (without requiring a "synchronous_updates" flag). * In the same setup with tablets, without the self-pairing optimization (due to this patch), this is not guaranteed. Some view updates may not be synchronous, i.e., the base write will not wait for the view write. If the user really wants synchronous updates, they should be requested explicitly, with the "synchronous_updates" view option. Fixes #16260. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16272
2023-12-04 11:22:04 +02:00
parent f483309165
commit 300e549267
1 changed files with 39 additions and 16 deletions
--- a/db/view/view.cc
+++ b/db/view/view.cc
@@ -1536,6 +1536,14 @@ bool needs_static_row(const mutation_partition& mp, const std::vector<view_and_b
 // of this function is to find, assuming that this node is one of the base
 // replicas for a given partition, the paired view replica.
 //
+// In the past, we used an optimization called "self-pairing" that if a single
+// node was both a base replica and a view replica for a write, the pairing is
+// modified so that this node would send the update to itself. This self-
+// pairing optimization could cause the pairing to change after view ranges
+// are moved between nodes, so currently we only use it if
+// use_legacy_self_pairing is set to true. When using tablets - where range
+// movements are common - it is strongly recommended to set it to false.
+//
 // If the keyspace's replication strategy is a NetworkTopologyStrategy,
 // we pair only nodes in the same datacenter.
 // If one of the base replicas also happens to be a view replica, it is
@@ -1550,7 +1558,8 @@ get_view_natural_endpoint(
        const locator::effective_replication_map_ptr& view_erm,
        bool network_topology,
        const dht::token& base_token,
-        const dht::token& view_token) {
+        const dht::token& view_token,
+        bool use_legacy_self_pairing) {
    auto& topology = base_erm->get_token_metadata_ptr()->get_topology();
    auto my_address = topology.my_address();
    auto my_datacenter = topology.get_datacenter();
@@ -1562,6 +1571,7 @@ get_view_natural_endpoint(
    }

    for (auto&& view_endpoint : view_erm->get_natural_endpoints(view_token)) {
+        if (use_legacy_self_pairing) {
            // If this base replica is also one of the view replicas, we use
            // ourselves as the view replica.
            if (view_endpoint == my_address) {
@@ -1577,12 +1587,19 @@ get_view_natural_endpoint(
            } else if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
                view_endpoints.push_back(view_endpoint);
            }
+        } else {
+            if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
+                view_endpoints.push_back(view_endpoint);
+            }
+        }
    }

    assert(base_endpoints.size() == view_endpoints.size());
    auto base_it = std::find(base_endpoints.begin(), base_endpoints.end(), my_address);
    if (base_it == base_endpoints.end()) {
        // This node is not a base replica of this key, so we return empty
+        // FIXME: This case shouldn't happen, and if it happens, a view update
+        // would be lost. We should reported or count this case.
        return {};
    }
    return view_endpoints[base_it - base_endpoints.begin()];
@@ -1638,7 +1655,13 @@ future<> view_update_generator::mutate_MV(
        auto view_ermp = mut.s->table().get_effective_replication_map();
        auto& ks = _proxy.local().local_db().find_keyspace(mut.s->ks_name());
        bool network_topology = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy());
-        auto target_endpoint = get_view_natural_endpoint(base_ermp, view_ermp, network_topology, base_token, view_token);
+        // We set legacy self-pairing for old vnode-based tables (for backward
+        // compatibility), and unset it for tablets - where range movements
+        // are more frequent and backward compatibility is less important.
+        // TODO: Maybe allow users to set use_legacy_self_pairing explicitly
+        // on a view, like we have the synchronous_updates_flag.
+        bool use_legacy_self_pairing = !ks.get_replication_strategy().uses_tablets();
+        auto target_endpoint = get_view_natural_endpoint(base_ermp, view_ermp, network_topology, base_token, view_token, use_legacy_self_pairing);
        auto remote_endpoints = view_ermp->get_pending_endpoints(view_token);
        auto sem_units = pending_view_updates.split(mut.fm.representation().size());