Merge 'doc: improve the docs for handling failures' from Anna Stuchlik
This PR improves the way of how handling failures is documented and accessible to the user. - The Handling Failures section is moved from Raft to Troubleshooting. - Two new topics about failure are added to Troubleshooting with a link to the Handling Failures page (Failure to Add, Remove, or Replace a Node, Failure to Update the Schema). - A note is added to the add/remove/replace node procedures to indicate that a quorum is required. See individual commits for more details. Fixes https://github.com/scylladb/scylladb/issues/13149 Closes scylladb/scylladb#15628 * github.com:scylladb/scylladb: doc: add a note about Raft doc: add the quorum requirement to procedures doc: add more failure info to Troubleshooting doc: move Handling Failures to Troubleshooting
This commit is contained in:
@@ -163,7 +163,7 @@ The message suggests the initial course of action:
|
||||
One of the reasons why the procedure may get stuck is a pre-existing problem in schema definitions which causes schema to be unable to synchronize in the cluster. The procedure cannot proceed unless it ensures that schema is synchronized.
|
||||
If **all nodes are alive and the network is healthy**, you performed a rolling restart, but the issue still persists, contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
|
||||
|
||||
If some nodes are **dead and irrecoverable**, you'll need to perform a manual recovery procedure. Consult :ref:`the section about Raft recovery <recover-raft-procedure>`.
|
||||
If some nodes are **dead and irrecoverable**, you'll need to perform a manual recovery procedure. Consult :ref:`the section about Raft recovery <recovery-procedure>`.
|
||||
|
||||
|
||||
Verifying that Raft is enabled
|
||||
@@ -189,7 +189,7 @@ on every node.
|
||||
|
||||
If the query returns 0 rows, or ``value`` is ``synchronize`` or ``use_pre_raft_procedures``, it means that the cluster is in the middle of the Raft upgrade procedure; consult the :ref:`relevant section <verify-raft-procedure>`.
|
||||
|
||||
If ``value`` is ``recovery``, it means that the cluster is in the middle of the manual recovery procedure. The procedure must be finished. Consult :ref:`the section about Raft recovery <recover-raft-procedure>`.
|
||||
If ``value`` is ``recovery``, it means that the cluster is in the middle of the manual recovery procedure. The procedure must be finished. Consult :ref:`the section about Raft recovery <recovery-procedure>`.
|
||||
|
||||
If ``value`` is anything else, it might mean data corruption or a mistake when performing the manual recovery procedure. The value will be treated as if it was equal to ``recovery`` when the node is restarted.
|
||||
|
||||
@@ -219,127 +219,8 @@ In summary, Raft makes schema changes safe, but it requires that a quorum of nod
|
||||
|
||||
Handling Failures
|
||||
------------------
|
||||
Raft requires a quorum of nodes in a cluster to be available. If one or more nodes are down, but the quorum is live, reads, writes,
|
||||
and schema updates proceed unaffected.
|
||||
When the node that was down is up again, it first contacts the cluster to fetch the latest schema and then starts serving queries.
|
||||
|
||||
The following examples show the recovery actions depending on the number of nodes and DCs in your cluster.
|
||||
|
||||
Examples
|
||||
=========
|
||||
|
||||
.. list-table:: Cluster A: 1 datacenter, 3 nodes
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1 node
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the node. If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
|
||||
* - 2 nodes
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- Restart at least 1 of the 2 nodes that are down to regain quorum. If you can’t recover at least 1 of the 2 nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
|
||||
|
||||
.. list-table:: Cluster B: 2 datacenters, 6 nodes (3 nodes per DC)
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1-2 nodes
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the node(s). If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
|
||||
* - 3 nodes
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- Restart 1 of the 3 nodes that are down to regain quorum. If you can’t recover at least 1 of the 3 failed nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
|
||||
* - 1DC
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
|
||||
|
||||
|
||||
.. list-table:: Cluster C: 3 datacenter, 9 nodes (3 nodes per DC)
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1-4 nodes
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the nodes. If the nodes are dead, :doc:`replace them with new nodes </operating-scylla/procedures/cluster-management/replace-dead-node-or-more/>`.
|
||||
* - 1 DC
|
||||
- Schema updates are possible and safe.
|
||||
- When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, :doc:`add 3 new nodes in a new region </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
|
||||
* - 2 DCs
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
|
||||
|
||||
.. _recover-raft-procedure:
|
||||
|
||||
Raft manual recovery procedure
|
||||
==============================
|
||||
|
||||
The manual Raft recovery procedure applies to the following situations:
|
||||
|
||||
* :ref:`The Raft upgrade procedure <verify-raft-procedure>` got stuck because one of your nodes failed in the middle of the procedure and is irrecoverable,
|
||||
* or the cluster was running Raft but a majority of nodes (e.g. 2 our of 3) failed and are irrecoverable. Raft cannot progress unless a majority of nodes is available.
|
||||
|
||||
.. warning::
|
||||
|
||||
Perform the manual recovery procedure **only** if you're dealing with **irrecoverable** nodes. If it is possible to restart your nodes, do that instead of manual recovery.
|
||||
|
||||
.. note::
|
||||
|
||||
Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, for example, temporarily partitioned away due to a network failure. If it is possible for the 'dead' nodes to come back to life, they might communicate and interfere with the recovery procedure and cause unpredictable problems.
|
||||
|
||||
If you have no means of ensuring that these irrecoverable nodes won't come back to life and communicate with the rest of the cluster, setup firewall rules or otherwise isolate your alive nodes to reject any communication attempts from these dead nodes.
|
||||
|
||||
During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`), delete the internal Raft data, and restart the cluster. This will cause the cluster to perform the Raft upgrade procedure again, initializing the Raft algorithm from scratch. The manual recovery procedure is applicable both to clusters which were not running Raft in the past and then had Raft enabled, and to clusters which were bootstrapped using Raft.
|
||||
|
||||
.. note::
|
||||
|
||||
Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while some nodes are already dead may lead to unavailability of data queries (assuming that you haven't lost it already). For example, if you're using the standard RF=3, CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one of your nodes is dead, restarting another node will cause temporary data query unavailability (until the node finishes restarting). Prepare your service for downtime before proceeding.
|
||||
|
||||
#. Perform the following query on **every alive node** in the cluster, using e.g. ``cqlsh``:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> UPDATE system.scylla_local SET value = 'recovery' WHERE key = 'group0_upgrade_state';
|
||||
|
||||
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
|
||||
|
||||
#. Verify that all the nodes have entered ``RECOVERY`` mode when restarting; look for one of the following messages in their logs:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
group0_client - RECOVERY mode.
|
||||
raft_group0 - setup_group0: Raft RECOVERY mode, skipping group 0 setup.
|
||||
raft_group0_upgrade - RECOVERY mode. Not attempting upgrade.
|
||||
|
||||
#. Remove all your dead nodes using the :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`.
|
||||
|
||||
#. Remove existing Raft cluster data by performing the following queries on **every alive node** in the cluster, using e.g. ``cqlsh``:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> TRUNCATE TABLE system.discovery;
|
||||
cqlsh> TRUNCATE TABLE system.group0_history;
|
||||
cqlsh> DELETE value FROM system.scylla_local WHERE key = 'raft_group0_id';
|
||||
|
||||
#. Make sure that schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and verifying that the schema version is the same on all nodes.
|
||||
|
||||
#. We can now leave ``RECOVERY`` mode. On **every alive node**, perform the following query:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> DELETE FROM system.scylla_local WHERE key = 'group0_upgrade_state';
|
||||
|
||||
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
|
||||
|
||||
#. The Raft upgrade procedure will start anew. :ref:`Verify <verify-raft-procedure>` that it finishes successfully.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>`.
|
||||
|
||||
.. _raft-learn-more:
|
||||
|
||||
|
||||
@@ -8,6 +8,12 @@ CQL stores data in *tables*, whose schema defines the layout of said data in the
|
||||
which is the replication strategy used by the keyspace. An application can have only one keyspace. However, it is also possible to
|
||||
have multiple keyspaces in case your application has different replication requirements.
|
||||
|
||||
.. note::
|
||||
|
||||
Schema updates require at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a schema is updated.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
This section describes the statements used to create, modify, and remove keyspaces and tables.
|
||||
|
||||
:ref:`CREATE KEYSPACE <create-keyspace-statement>`
|
||||
|
||||
@@ -5,6 +5,11 @@ Adding a New Node Into an Existing ScyllaDB Cluster (Out Scale)
|
||||
When you add a new node, other nodes in the cluster stream data to the new node. This operation is called bootstrapping and may
|
||||
be time-consuming, depending on the data size and network bandwidth. If using a :ref:`multi-availability-zone <faq-best-scenario-node-multi-availability-zone>`, make sure they are balanced.
|
||||
|
||||
.. note::
|
||||
|
||||
Adding a new node requires at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a new node is added.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
@@ -4,6 +4,12 @@ Remove a Node from a ScyllaDB Cluster (Down Scale)
|
||||
|
||||
You can remove nodes from your cluster to reduce its size.
|
||||
|
||||
.. note::
|
||||
|
||||
Removing a node requires at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a node is removed.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
-----------------------
|
||||
Removing a Running Node
|
||||
-----------------------
|
||||
|
||||
@@ -2,7 +2,11 @@
|
||||
Replace More Than One Dead Node In A ScyllaDB Cluster
|
||||
******************************************************
|
||||
|
||||
Scylla is a fault-tolerant system. A cluster can be available even when more than one node is down.
|
||||
.. note::
|
||||
|
||||
Replacing a node requires at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a node is replaced.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
@@ -5,6 +5,12 @@ Replace dead node operation will cause the other nodes in the cluster to stream
|
||||
|
||||
This procedure is for replacing one dead node. To replace more than one dead node, run the full procedure to completion one node at a time.
|
||||
|
||||
.. note::
|
||||
|
||||
Replacing a node requires at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a node is replaced.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
-------------
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
@@ -7,6 +7,12 @@ There are two methods to replace a running node in a Scylla cluster.
|
||||
#. `Add a new node to the cluster and then decommission the old node`_
|
||||
#. `Replace a running node - by taking its place in the cluster`_
|
||||
|
||||
.. note::
|
||||
|
||||
Replacing a node requires at least a quorum of nodes in a cluster to be available.
|
||||
If the quorum is lost, it must be restored before a node is replaced.
|
||||
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
|
||||
|
||||
|
||||
Add a new node to the cluster and then decommission the old node
|
||||
=================================================================
|
||||
|
||||
@@ -5,6 +5,8 @@ Cluster and Node
|
||||
:hidden:
|
||||
:maxdepth: 2
|
||||
|
||||
Handling Node Failures </troubleshooting/handling-node-failures>
|
||||
Failure to Add, Remove, or Replace a Node </troubleshooting/failed-add-remove-replace>
|
||||
Failed Decommission Problem </troubleshooting/failed-decommission/>
|
||||
Cluster Timeouts </troubleshooting/timeouts>
|
||||
Node Joined With No Data </troubleshooting/node-joined-without-any-data>
|
||||
@@ -21,6 +23,8 @@ Cluster and Node
|
||||
</div>
|
||||
<div class="medium-9 columns">
|
||||
|
||||
* :doc:`Handling Node Failures </troubleshooting/handling-node-failures>`
|
||||
* :doc:`Failure to Add, Remove, or Replace a Node </troubleshooting/failed-add-remove-replace>`
|
||||
* :doc:`Failed Decommission Problem </troubleshooting/failed-decommission/>`
|
||||
* :doc:`Cluster Timeouts </troubleshooting/timeouts>`
|
||||
* :doc:`Node Joined With No Data </troubleshooting/node-joined-without-any-data>`
|
||||
|
||||
9
docs/troubleshooting/failed-add-remove-replace.rst
Normal file
9
docs/troubleshooting/failed-add-remove-replace.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
Failure to Add, Remove, or Replace a Node
|
||||
------------------------------------------------
|
||||
|
||||
ScyllaDB relies on the Raft consensus algorithm, which requires at least a quorum
|
||||
of nodes in a cluster to be available. If some nodes are down and the quorum is
|
||||
lost, adding, removing, and replacing a node fails.
|
||||
|
||||
See :doc:`Handling Node Failures <handling-node-failures>` for information about
|
||||
recovery actions depending on the number of nodes and DCs in your cluster.
|
||||
9
docs/troubleshooting/failed-update-schema.rst
Normal file
9
docs/troubleshooting/failed-update-schema.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
Failure to Update the Schema
|
||||
------------------------------------------------
|
||||
|
||||
ScyllaDB relies on the Raft consensus algorithm, which requires at least a quorum
|
||||
of nodes in a cluster to be available. If some nodes are down and the quorum is
|
||||
lost, schema updates fail.
|
||||
|
||||
See :doc:`Handling Node Failures <handling-node-failures>` for information about
|
||||
recovery actions depending on the number of nodes and DCs in your cluster.
|
||||
159
docs/troubleshooting/handling-node-failures.rst
Normal file
159
docs/troubleshooting/handling-node-failures.rst
Normal file
@@ -0,0 +1,159 @@
|
||||
Handling Node Failures
|
||||
------------------------
|
||||
|
||||
.. note::
|
||||
|
||||
This page applies to ScyllaDB clusters that use Raft to ensure consistency.
|
||||
You can verify that Raft-based consistent management is enabled for your
|
||||
cluster in the ``scylla.yaml`` file (enabled by default):
|
||||
``consistent_cluster_management: true``
|
||||
|
||||
.. REMOVE IN FUTURE VERSIONS - Remove the above note when Raft is mandatory
|
||||
and default for both new and existing clusters.
|
||||
|
||||
ScyllaDB relies on the Raft consensus algorithm, which requires at least a quorum
|
||||
of nodes in a cluster to be available. If one or more nodes are down, but the quorum
|
||||
is live, reads, writes, and schema updates proceed unaffected. When the node that
|
||||
was down is up again, it first contacts the cluster to fetch the latest schema and
|
||||
then starts serving queries.
|
||||
|
||||
The following examples show the recovery actions when one or more nodes or DCs
|
||||
are down, depending on the number of nodes and DCs in your cluster.
|
||||
|
||||
Examples
|
||||
=========
|
||||
|
||||
.. list-table:: Cluster A: 1 datacenter, 3 nodes
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1 node
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the node. If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
|
||||
* - 2 nodes
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- Restart at least 1 of the 2 nodes that are down to regain quorum. If you can’t recover at least 1 of the 2 nodes, consult the :ref:`manual recovery section <recovery-procedure>`.
|
||||
|
||||
.. list-table:: Cluster B: 2 datacenters, 6 nodes (3 nodes per DC)
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1-2 nodes
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the node(s). If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
|
||||
* - 3 nodes
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- Restart 1 of the 3 nodes that are down to regain quorum. If you can’t recover at least 1 of the 3 failed nodes, consult the :ref:`manual recovery <recovery-procedure>` section.
|
||||
* - 1DC
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, consult the :ref:`manual recovery <recovery-procedure>` section.
|
||||
|
||||
|
||||
.. list-table:: Cluster C: 3 datacenter, 9 nodes (3 nodes per DC)
|
||||
:widths: 20 40 40
|
||||
:header-rows: 1
|
||||
|
||||
* - Failure
|
||||
- Consequence
|
||||
- Action to take
|
||||
* - 1-4 nodes
|
||||
- Schema updates are possible and safe.
|
||||
- Try restarting the nodes. If the nodes are dead, :doc:`replace them with new nodes </operating-scylla/procedures/cluster-management/replace-dead-node-or-more/>`.
|
||||
* - 1 DC
|
||||
- Schema updates are possible and safe.
|
||||
- When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, :doc:`add 3 new nodes in a new region </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
|
||||
* - 2 DCs
|
||||
- Data is available for reads and writes, schema changes are impossible.
|
||||
- When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, consult the :ref:`manual recovery <recovery-procedure>` section.
|
||||
|
||||
.. _recovery-procedure:
|
||||
|
||||
Manual Recovery Procedure
|
||||
===========================
|
||||
|
||||
You can follow the manual recovery procedure when:
|
||||
|
||||
* The majority of nodes (for example, 2 out of 3) failed and are irrecoverable.
|
||||
* :ref:`The Raft upgrade procedure <verify-raft-procedure>` got stuck because one
|
||||
of the nodes failed in the middle of the procedure and is irrecoverable. This
|
||||
may occur in existing clusters where Raft was manually enabled.
|
||||
See :ref:`Enabling Raft <enabling-raft-existing-cluster>` for details.
|
||||
|
||||
.. warning::
|
||||
|
||||
Perform the manual recovery procedure **only** if you're dealing with
|
||||
**irrecoverable** nodes. If possible, restart your nodes, and use the manual
|
||||
recovery procedure as a last resort.
|
||||
|
||||
.. note::
|
||||
|
||||
Before proceeding, make sure that the irrecoverable nodes are truly dead, and not,
|
||||
for example, temporarily partitioned away due to a network failure. If it is
|
||||
possible for the 'dead' nodes to come back to life, they might communicate and
|
||||
interfere with the recovery procedure and cause unpredictable problems.
|
||||
|
||||
If you have no means of ensuring that these irrecoverable nodes won't come back
|
||||
to life and communicate with the rest of the cluster, setup firewall rules or otherwise
|
||||
isolate your alive nodes to reject any communication attempts from these dead nodes.
|
||||
|
||||
During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove
|
||||
all faulty nodes (using the standard :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`),
|
||||
delete the internal Raft data, and restart the cluster. This will cause the cluster to
|
||||
perform the Raft upgrade procedure again, initializing the Raft algorithm from scratch.
|
||||
|
||||
The manual recovery procedure is applicable both to clusters that were not running Raft
|
||||
in the past and then had Raft enabled, and to clusters that were bootstrapped using Raft.
|
||||
|
||||
.. note::
|
||||
|
||||
Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while
|
||||
some nodes are already dead may lead to unavailability of data queries (assuming that
|
||||
you haven't lost it already). For example, if you're using the standard RF=3,
|
||||
CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one
|
||||
of your nodes is dead, restarting another node will cause temporary data query
|
||||
unavailability (until the node finishes restarting). Prepare your service for
|
||||
downtime before proceeding.
|
||||
|
||||
#. Perform the following query on **every alive node** in the cluster, using e.g. ``cqlsh``:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> UPDATE system.scylla_local SET value = 'recovery' WHERE key = 'group0_upgrade_state';
|
||||
|
||||
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
|
||||
|
||||
#. Verify that all the nodes have entered ``RECOVERY`` mode when restarting; look for one of the following messages in their logs:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
group0_client - RECOVERY mode.
|
||||
raft_group0 - setup_group0: Raft RECOVERY mode, skipping group 0 setup.
|
||||
raft_group0_upgrade - RECOVERY mode. Not attempting upgrade.
|
||||
|
||||
#. Remove all your dead nodes using the :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`.
|
||||
|
||||
#. Remove existing Raft cluster data by performing the following queries on **every alive node** in the cluster, using e.g. ``cqlsh``:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> TRUNCATE TABLE system.discovery;
|
||||
cqlsh> TRUNCATE TABLE system.group0_history;
|
||||
cqlsh> DELETE value FROM system.scylla_local WHERE key = 'raft_group0_id';
|
||||
|
||||
#. Make sure that schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and verifying that the schema version is the same on all nodes.
|
||||
|
||||
#. We can now leave ``RECOVERY`` mode. On **every alive node**, perform the following query:
|
||||
|
||||
.. code-block:: cql
|
||||
|
||||
cqlsh> DELETE FROM system.scylla_local WHERE key = 'group0_upgrade_state';
|
||||
|
||||
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
|
||||
|
||||
#. The Raft upgrade procedure will start anew. :ref:`Verify <verify-raft-procedure>` that it finishes successfully.
|
||||
@@ -8,6 +8,7 @@ Data Modeling
|
||||
Scylla Large Partitions Table </troubleshooting/large-partition-table/>
|
||||
Scylla Large Rows and Cells Table </troubleshooting/large-rows-large-cells-tables/>
|
||||
Large Partitions Hunting </troubleshooting/debugging-large-partition/>
|
||||
Failure to Update the Schema </troubleshooting/failed-update-schema>
|
||||
|
||||
.. raw:: html
|
||||
|
||||
@@ -25,6 +26,8 @@ Data Modeling
|
||||
|
||||
* :doc:`Large Partitions Hunting </troubleshooting/debugging-large-partition/>`
|
||||
|
||||
* :doc:`Failure to Update the Schema </troubleshooting/failed-update-schema>`
|
||||
|
||||
`Data Modeling course <https://university.scylladb.com/courses/data-modeling/>`_ on Scylla University
|
||||
|
||||
.. raw:: html
|
||||
|
||||
Reference in New Issue
Block a user