Commit Graph

657 Commits

Author SHA1 Message Date
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
31c995332c storage_service, raft topology: Run streaming under session topology guard
Prevents stale streaming operation from running beyond topology
operation they were started in. After the session field is cleared, or
changed to something else, the old topology_guard used by streaming is
interrupted and fenced and the next barrier will join with any
remaining work.
2023-12-06 18:36:17 +01:00
Patryk Jędrzejczak
c8ee7d4499 db: make schema commitlog feature mandatory
Using consistent cluster management and not using schema commitlog
ends with a bad configuration throw during bootstrap. Soon, we
will make consistent cluster management mandatory. This forces us
to also make schema commitlog mandatory, which we do in this patch.

A booting node decides to use schema commitlog if at least one of
the two statements below is true:
- the node has `force_schema_commitlog=true` config,
- the node knows that the cluster supports the `SCHEMA_COMMITLOG`
  cluster feature.

The `SCHEMA_COMMITLOG` cluster feature has been added in version
5.1. This patch is supposed to be a part of version 6.0. We don't
support a direct upgrade from 5.1 to 6.0 because it skips two
versions - 5.2 and 5.4. So, in a supported upgrade we can assume
that the version which we upgrade from has schema commitlog. This
means that we don't need to check the `SCHEMA_COMMITLOG` feature
during an upgrade.

The reasoning above also applies to Scylla Enterprise. Version
2024.2 will be based on 6.0. Probably, we will only support
an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even
if we support an upgrade from 2023.x, this patch won't break
anything because 2023.1 is based on 5.2, which has schema
commitlog. Upgrades from 2022.x definitely won't be supported.

When we populate a new cluster, we can use the
`force_schema_commitlog=true` config to use schema commitlog
unconditionally. Then, the cluster feature check is irrelevant.
This check could fail because we initiate schema commitlog before
we learn about the features. The `force_schema_commitlog=true`
config is especially useful when we want to use consistent cluster
management. Failing feature checks would lead to crashes during
initial bootstraps. Moreover, there is no point in creating a new
cluster with `consistent_cluster_management=true` and
`force_schema_commitlog=false`. It would just cause some initial
bootstraps to fail, and after successful restarts, the result would
be the same as if we used `force_schema_commitlog=true` from the
start.

In conclusion, we can unconditionally use schema commitlog without
any checks in 6.0 because we can always safely upgrade a cluster
and start a new cluster.

Apart from making schema commitlog mandatory, this patch adds two
changes that are its consequences:
- making the unneeded `force_schema_commitlog` config unused,
- deprecating the `SCHEMA_COMMITLOG` feature, which is always
  assumed to be true.

Closes scylladb/scylladb#16254
2023-12-04 21:02:16 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Gleb Natapov
95dd0e453d storage_service: topology coordinator: add rollback_to_normal node state
When a topology coordinator rolls back from unsuccessful topology operation it
advances the fence (which is now in the raft state) after moving to normal
state. We do not want this to fail (only majority of nodes is needed for
it to not to), but currently it may fail in case the coordinator moves
to another node after changing the rollback node's state to normal, but
before updating the fence. To solve that the rollback operation needs to
go through a new rollback_to_normal state that will do the fencing
before moving to normal. This patch introduces that state, but does not use
it yet.
2023-11-23 15:27:28 +02:00
Gleb Natapov
6edbf4b663 storage_service: topology coordinator: put fence version into the raft state
Currently when the coordinator decides to move the fence it issues an
RPC to each node and each node locally advances fence version. This is
fine if there are no failures or failures are handled by retrying
fencing, but if we want to allow topology changes to progress even in
the presence of barrier failures it is easier to store the fence version
in the raft state. The nodes that missed fence rpc may easily catch up
to the latest fence version by simply executing a raft barrier.
2023-11-19 15:28:08 +02:00
Kamil Braun
f094e23d84 system_keyspace: use system memory for system.raft table
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: scylladb/scylladb#15622

Closes scylladb/scylladb#15972
2023-11-08 11:21:14 +02:00
Botond Dénes
76ab66ca1f Merge 'Support state change for S3-backed sstables' from Pavel Emelyanov
The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly.

There's also the upload facility that's implemented as state change too, but this PR doesn't support this part.

fixes: #13017

Closes scylladb/scylladb#15829

* github.com:scylladb/scylladb:
  test: Make test_sstables_excluding_staging_correctness run over s3 too
  sstables,s3: Support state change (without generation change)
  system_keyspace: Add state field to system.sstables
  sstable_directory: Tune up sstables entries processing comment
  system_keyspace: Tune up status change trace message
  sstables: Add state string to state enum class convert
2023-11-07 10:45:41 +02:00
Kamil Braun
0846d324d7 Merge 'rollback topology operation on streaming failure' from Gleb
This patch series adds error handling for streaming failure during
topology operations instead of an infinite retry. If streaming fails the
operation is rolled back: bootstrap/replace nodes move to left and
decommissioned/remove nodes move back to normal state.

* 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev:
  raft: make sure that all operation forwarded to a leader are completed before destroying raft server
  storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier
  tests: add tests for streaming failure in bootstrap/replace/remove/decomission
  test/pylib: do not stop node if decommission failed with an expected error
  storage_service: raft topology: fix typo in "decommission" everywhere
  storage_service: raft topology: add streaming error injection
  storage_service: raft topology: do not increase topology version during CDC repair
  storage_service: raft topology: rollback topology operation on streaming failure.
  storage_service: raft topology: load request parameters in left_token_ring state as well
  storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error
  storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch
  storage_service: raft topology: make global_token_metadata_barrier node independent
  storage_service: raft topology: split get_excluded_nodes from exec_global_command
  storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true
  storage_service: raft topology: simplify streaming RPC failure handling
2023-11-02 10:15:45 +01:00
Gleb Natapov
0a8c3e5c78 storage_service: raft topology: load request parameters in left_token_ring state as well
Next patch will want to access request parameters in left_token_ring for
failure recovery purposes.
2023-10-25 12:56:27 +03:00
Piotr Dulikowski
63aa9332aa raft topology: assign tokens after join node response rpc
Currently, when the topology coordinator accepts a node, it moves it to
bootstrap state and assigns tokens to it (either new ones during
bootstrap, or the replaced node's tokens). Only then it contacts the
joining node to tell it about the decision and let it perform a read
barrier.

However, this means that the tokens are inserted too early. After
inserting the tokens the cluster is free to route write requests to it,
but it might not have learned about all of the schema yet.

Fix the issue by inserting the tokens later, after completing the join
node response RPC which forces the receiving node to perform a read
barrier.
2023-10-25 11:50:17 +02:00
Piotr Dulikowski
2d161676c7 raft topology: loosen assumptions about transition nodes having tokens
In later commits, tokens for a joining/replacing node will not be
inserted when the node enters `bootstrapping`/`replacing` state but at
some later step of the procedure. Loosen some of the assumptions in
`storage_service::topology_state_load` and
`system_keyspace::load_topology_state` appropriately.
2023-10-25 11:50:17 +02:00
Pavel Emelyanov
d827068d01 sstables,s3: Support state change (without generation change)
Now when the system.sstables has the state field, it can be changed
(UPDATEd). However, when changing the state AND generation, this still
won't work, because generation is the clustering key of the table in
question and cannot be just changed. This, nonetheless, is OK, as
generation changes with state only when moving an sstable from upload
dir into normal/staging and this is separate issue for S3 (#13018). For
now changing state only is OK.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 19:12:37 +03:00
Pavel Emelyanov
ca5d3d217f system_keyspace: Add state field to system.sstables
The state is one of <empty>(normal)/staging/quarantine. Currently when
sstable is moved to non-normal state the s3 backend state_change() call
throws thus such sstables do not appear. Next patches are going to
change that and the new field in the system.sstables is needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 19:12:37 +03:00
Pavel Emelyanov
e4162227ff system_keyspace: Tune up status change trace message
There will appear very similar one tracing the state change, so it's
good to tell them from one another.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 19:12:37 +03:00
Kefu Chai
b36cef6f1a sstable: remove _remote_prefix from s3_storage
since we use the sstable.generation() for the remote prefix of
the key of the object for storing the sstable component, there is
no need to set remote_prefix beforehand.

since `s3_storage::ensure_remote_prefix()` and
`system_kesypace::sstables_registry_lookup_entry()` are not used
anymore, they are removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:08:22 +08:00
Kefu Chai
af8bc8ba63 sstable: switch to uuid identifier for naming S3 sstable objects
before this change, we create a new UUID for a new sstable managed
by the s3_storage, and we use the string representation of UUID
defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for
naming the objects stored on s3_storage. but this representation is
not what we are using for storing sstables on local filesystem when
the option of "uuid_sstable_identifiers_enabled" is enabled. instead,
we are using a base36-based representation which is shorter.

to be consistent with the naming of the sstables created for local
filesystem, and more importantly, to simplify the interaction between
the local copy of sstables and those stored on object storage, we should
use the same string representation of the sstable identifier.

so, in this change:

1. instead of creating a new UUID, just reuse the generation of the
   sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As
   we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined
   by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the
   base36-based one (implemented by
   fmt::formatter<sstables::generation_type>)
4. enable the `uuid_sstable_identifers` cluster feature if it is
   enabled in the `test_env_config`, so that it the sstable manager
   can enable the uuid-based uuid when creating a new uuid for
   sstable.
5. throw if the generation of sstable is not UUID-based when
   accessing / manipulating an sstable with S3 storage backend. as
   the S3 storage backend now relies on this option. as, otherwise
   we'd have sstables with key like s3://bucket/number/basename, which
   is just unable to serve as a unique id for sstable if the bucket is
   shared across multiple tables.

Fixes #14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:08:22 +08:00
Kamil Braun
c1486fee40 Merge 'commitlog: drop truncation_records after replay' from Petr Gusev
This is a follow-up for #15279 and it fixes two problems.

First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed.

Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table.

It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP.

To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem.

Fixes #15354

Closes scylladb/scylladb#15532

* github.com:scylladb/scylladb:
  add test_commitlog
  system.truncated: Remove replay_position data from truncated on start
  main.cc: flush only local memtables when replaying schema commitlog
  main.cc: drop redundant supervisor::notify
  system_keyspace: flush if schema commitlog is not available
2023-10-18 11:14:31 +02:00
Botond Dénes
7f81957437 Merge 'Initialize datadir for system and non-system keyspaces the same way' from Pavel Emelyanov
When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what.

This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables.

Test included.

fixes: #15708
closes: scylladb/scylla-manager#3603

Closes scylladb/scylladb#15723

* github.com:scylladb/scylladb:
  test: Add test for datadir/ layout
  sstable_directory: Indentation fix after previous patch
  db,sstables: Move storage init for system keyspace to table creation
2023-10-18 12:12:19 +03:00
Calle Wilund
6fbd210679 system.truncated: Remove replay_position data from truncated on start
Once we've started clean, and all replaying is done, truncation logs
commit log regarding replay positions are invalid. We should exorcise
them as soon as possible. Note that we cannot remove truncation data
completely though, since the time stamps stored are used by things like
batch log to determine if it should use or discard old batch data.
2023-10-17 18:16:48 +04:00
Petr Gusev
c89ead55ff system_keyspace: flush if schema commitlog is not available
In PR #15279 we removed flushes when writing to a number
of tables from the system keyspace. This was made possible
by switching these tables to the schema commitlog.
Schema commitlog is enabled only when the SCHEMA_COMMITLOG
feature is supported by all nodes in the cluster. Before that
these tables will use the regular commitlog, which is not
durable because it uses db::commitlog::sync_mode::PERIODIC. This
means that we may lose data if a node crashes during upgrade
to the version with schema commitlog.

In this commit we fix this problem by restoring flushes
after writes to the tables if the schema commitlog
is not enabled yet.

The patch also contains a test that demonstrates the
problem. We need flush_schema_tables_after_modification
option since otherwise schema changes are not durable
and node fails after restart.
2023-10-17 18:14:27 +04:00
Tomasz Grabiec
0aef0f900b Merge 'truncation records refactorings' from Petr Gusev
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`

Closes scylladb/scylladb#15583

* github.com:scylladb/scylladb:
  system_keyspace: drop truncation_record
  system_keyspace: remove get_truncated_at method
  table: get_truncation_time: check _truncated_at is initialized
  database: add_column_family: initialize truncation_time for new tables
  database: add_column_family: rename readonly parameter to is_new
  system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
  commitlog_replayer: refactor commitlog_replayer::impl::init
  system_keyspace: drop redundant typedef
  system_keyspace: drop redundant save_truncation_record overload
  table: rename cache_truncation_record -> set_truncation_time
  system_keyspace: get_truncated_position -> get_truncated_positions
2023-10-17 10:55:30 +02:00
Pavel Emelyanov
059d7c795e db,sstables: Move storage init for system keyspace to table creation
User and system keyspaces are created and populated slightly
differently.

System keyspace is created via system_keyspace::make() which eventually
calls calls add_column_family(). Then it's populated via
init_system_keyspace() which calls sstable_directory::prepare() which,
in turn, optionally creates directories in datadir/ or checks the
directory permissions if it exists

User keyspaces are created with the help of
add_column_family_and_make_directory() call which calls the
add_column_family() mentioned above _and_ calls table::init_storage() to
create directories. When it's populated with init_non_system_keyspaces()
it also calls sstable_directory::prepare() which notices that the
directory exists and then checks the permissions.

As a result, sstable_directory::prepare() initializes storage for system
keyspace only and there's a BUG (#15708) that the upload/ subdir is not
created.

This patch makes the directories creation for _all_ keyspaces with the
table::init_storage(). The change only touches system keyspace by moving
the creation of directories from sstable_directory::prepare() into
system_keyspace::make().

Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-16 16:19:25 +03:00
Avi Kivity
35849fc901 Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun"
This reverts commit 3d4398d1b2, reversing
changes made to 45dfce6632. The commit
causes some schema changes to be lost due to incorrect timestamps
in some mutations. More information is available in [1].

Reopens: scylladb/scylladb#7620
Reopens: scylladb/scylladb#13957

Fixes scylladb/scylladb#15530.

[1] https://github.com/scylladb/scylladb/pull/15687
2023-10-11 00:32:05 +03:00
Petr Gusev
a6087a10bd system_keyspace: drop truncation_record
This is a refactoring commit without observable
changes in behaviour.

The only usage was in get_truncation_records
method which can be inlined.
2023-10-05 15:19:59 +04:00
Petr Gusev
9d350e7532 system_keyspace: remove get_truncated_at method
The only usage is in batchlog_manager, and it
can be replaced with cf.get_truncation_time().

std::optional<std::reference_wrapper<canonical_mutation>>
is replaced with canonical_mutation* since it is
semantically the same but with less type boilerplate.
2023-10-05 15:19:59 +04:00
Petr Gusev
b70bca71bc system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
load_truncation_times() now works only for
schema tables since the rest is not loaded
until distributed_loader::init_non_system_keyspaces.
An attempt to call cf.set_truncation_time
for non-system table just throws an exception,
which is caught and logged with debug level.
This means that the call cf.get_truncation_time in
paxos_state.cc has never worked as expected.

To fix that we move load_truncation_times()
closer to the point where the tables are loaded.
The function distributed_loader::populate_keyspace is
called for both system and non-system tables. Once
the tables are loaded, we use the 'truncated' table
to initialize _truncated_at field for them.

The truncation_time check for schema tables is also moved
into populate_keyspace since is seems like a more natural
place for it.
2023-10-05 15:19:52 +04:00
Pavel Emelyanov
96651e0ddb sstables: Do not keep directory, keyspace and table names on descriptor
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:01 +03:00
Petr Gusev
f7d2300cf9 system_keyspace: drop redundant save_truncation_record overload 2023-10-03 17:11:40 +04:00
Petr Gusev
da1e6751e9 table: rename cache_truncation_record -> set_truncation_time
This is a refactoring commit without observable
changes in behaviour.

There is a truncation_record struct, but in this method we
only care about time, so rename it (and other related methods)
appropriately to avoid confusion.
2023-10-03 17:11:35 +04:00
Petr Gusev
1b2e0d0cc9 system_keyspace: get_truncated_position -> get_truncated_positions
This method can return many replay_positions, so
the plural form is more appropriate.
2023-09-28 12:25:40 +04:00
Piotr Dulikowski
caf1d4938e topology_state_machine: add supported_features to replica_state
The `service::topology_features` struct was introduced in #14955. Its
purpose was to make it possible to load cluster features from
`system.topology` before schema commitlog replay. It contains a map from
host ID to supported feature set for every normal node.

In order not to duplicate logic for loading features,
the `service::topology`'s `replica_state`s do not hold a set of
supported features and users are supposed to refer to the features
in `topology_features`, which is a field in the `topology` struct.
However, accessing features is quite awkward now.

This commit adds `supported_features` field back to the `replica_state`
struct and the `load_topology_state` function initializes them properly.
The logic duplication needed to initialize them is quite small and the
drawbacks that come with it are outweighed by the fact that we now can
refer to node's supported features in a more natural way.

The `topology_features` struct is no longer a field of `topology`, but
it still exists for the purpose of the feature check that happens before
commitlog replay.
2023-09-26 15:56:52 +02:00
Tomasz Grabiec
3d4398d1b2 Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).

If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).

When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.

We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.

Fixes: #7620
Fixes: #13957

Closes scylladb/scylladb#15331

* github.com:scylladb/scylladb:
  test: add test for group 0 schema versioning
  test/pylib: log_browsing: fix type hint
  feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
  schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
  migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
  schema_tables: use schema version from group 0 if present
  migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
  migration_manager: migration_request handler: assume `canonical_mutation` support
  system_keyspace: make `get/set_scylla_local_param` public
  feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
  schema_tables: refactor `scylla_tables(schema_features)`
  migration_manager: add `std::move` to avoid a copy
  schema_tables: remove default value for `reload` in `merge_schema`
  schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
  system_keyspace: fix outdated comment
2023-09-20 10:43:40 +02:00
Michael Huang
62a8a31be7 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428
2023-09-18 23:17:01 +03:00
Kamil Braun
bc6f7d1b20 Merge 'raft topology: add garbage collection for internal CDC generations table' from Patryk Jędrzejczak
We add garbage collection for the `CDC_GENERATIONS_V3` table to prevent
it from endlessly growing. This mechanism is especially needed because
we send the entire contents of `CDC_GENERATIONS_V3` as a part of the
group 0 snapshot.

The solution is to keep a clean-up candidate, which is one of the
already published CDC generations. The CDC generation publisher
introduced in #15281 continually uses this candidate to remove all
generations with timestamps not exceeding the candidate's and sets a new
candidate when needed.

We also add `test_cdc_generation_clearing.py` that verifies this new
mechanism.

Fixes #15323

Closes scylladb/scylladb#15413

* github.com:scylladb/scylladb:
  test: add test_cdc_generation_clearing
  raft topology: remove obsolete CDC generations
  raft topology: set CDC generation clean-up candidate
  topology_coordinator: refactor publish_oldest_cdc_generation
  system_keyspace: introduce decode_cdc_generation_id
  system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
2023-09-18 11:30:10 +02:00
Kamil Braun
7ab7588d59 migration_manager: store group0_schema_version in scylla_local during schema changes
We extend schema mutations with an additional mutation to the
`system.scylla_local` table which:
- in Raft mode, stores a UUID under the `group0_schema_version` key.
- outside Raft mode, stores a tombstone under that key.

As we will see in later commits, nodes will use this after applying
schema mutations. If the key is absent or has a tombstone, they'll
calculate the global schema digest on their own -- using the old way. If
the key is present, they'll take the schema version from there.

The Raft-mode schema version is equal to the group 0 state ID of this
schema command.

The tombstone is necessary for the case of performing a schema change in
RECOVERY mode. It will force a revert to the old digest-based way.

Note that extending schema mutations with a `system.scylla_local`
mutation is possible thanks to earlier commits which moved
`system.scylla_local` to schema commitlog, so all mutations in the
schema mutations vector still go to the same commitlog domain.
2023-09-15 14:32:45 +02:00
Patryk Jędrzejczak
e375e769b9 raft topology: set CDC generation clean-up candidate
We want to use the clean-up candidates to remove the obsolete CDC
generation data, but first, we need to set suitable generations as
a candidate when there is no candidate. Since CDC generations must
be published before we remove them, a generation that is being
published is a good candidate.
2023-09-15 09:23:59 +02:00
Patryk Jędrzejczak
c0fd42ead4 system_keyspace: introduce decode_cdc_generation_id
The decode_cdc_generations_ids function allows us to decode
a vector of CDC generation IDs. After adding cleanup_candidate
to CDC_GENERATIONS_V3, we need a similar function that decodes
a single ID.
2023-09-14 12:09:14 +02:00
Patryk Jędrzejczak
6db325fb69 system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
In the following commits, we implement a garbage collection for
CDC_GENERATIONS_V3. The first step is introducing the clean-up
candidate. It will be continually updated by the CDC generation
publisher and used to remove obsolete data.
2023-09-14 12:09:10 +02:00
Petr Gusev
082cd3bc8e system_keyspace: switch CDC_LOCAL to schema commitlog 2023-09-13 23:17:20 +04:00
Petr Gusev
a683cebb02 system_keyspace: scylla_local: use schema commitlog
We remove flush from set_scylla_local_param_as
since it's now redundant. We add it to
save_local_enabled_features as features need to
be available before schema commitlog replay.

We skip the flush if save_local_enabled_features
is called from topology_state_load when the features
are migrated to system.topology and we don't need
strict durability.
2023-09-13 23:17:20 +04:00
Petr Gusev
beb29f094b system_keyspace: drop load phases
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before  schema commitlog initialization for
SCHEMA_COMMITLOG feature.

In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.

In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.

The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
2023-09-13 23:17:20 +04:00
Petr Gusev
0e5f9ae9a4 system_keyspace: switch system.peers to schema commitlog
Also, we remove flushes on writes as durability
is now guaranteed by the commitlog.
2023-09-13 23:17:20 +04:00
Petr Gusev
7881ce1e09 system_keyspace: switch system.local to schema commitlog
Schema commitlog lives only on the zero shard,
so we need to turn on use_null_sharder option.

Also, we remove flushes on writes as durability
is now guaranteed by the commitlog.
2023-09-13 23:17:20 +04:00
Petr Gusev
2a0b228d17 main.cc: inline and split system_keyspace.setup
Our goal is to switch system.local table to schema
commitlog and stop doing flushes when we write to it.
This means it would be incorrect to read from this
table until schema commitlog is replayed.

On the other hand, we need truncation records
to be loaded before we start replaying schema
commitlog, since commitlog_replayer relies on them.

In this commit we inline the system_keyspace::setup
function and split its content into two parts. In
the first part, before schema commitlog replay,
we load truncation records. It's safe to load
them before schema commitlog replay since we intend
to let the flushes on writes to system.truncated
table. In the second part, after schema commitlog replay,
we do the rest of the job - build_bootstrap_info and
db::schema_tables::save_system_schema.

We decided to inline this function since there is
very low cohesion between the actions it's performing.
It's just simpler to reason about them individually.
2023-09-13 23:00:15 +04:00
Petr Gusev
f0bc9f2d93 system_keyspace: refactor save_system_schema function
This is a refactoring commit without observable changes
in behaviour.

Previously, there were two related functions in db::schema_tables:
save_system_keyspace_schema(qp) and save_system_schema(qp, ks).
The first called the second passing "system_schema" as
the second argument. Outside of schema_tables module we
don't need two functions, we just need a way to say
'persist system schema objects in the appropriate tables/keyspaces'.
In this commit we change the function save_system_schema
to have this meaning. Internally it calls save_system_schema_to_keyspace
twice with "system_schema" and "system", since that's what we need
in the single call site of this function in system_keyspace::setup.
In subsequent commits we are going to move this call out of the
system_keyspace::setup.
2023-09-13 23:00:15 +04:00
Petr Gusev
e395086557 system_keyspace: move initialize_virtual_tables into virtual_tables.hh
This is a readability refactoring commit without observable changes
in behaviour.

initialize_virtual_tables logically belongs to virtual_tables module,
and it allows to make other functions in virtual_tables.cc
(register_virtual_tables, install_virtual_readers)
local to the module, which simplifies the matters a bit.

all_virtual_tables() is not needed anymore, all the references to
registered virtual tables are now local to virtual_tables module
and can just use virtual_tables variable directly.
2023-09-13 23:00:15 +04:00
Petr Gusev
c4787a160b system_keyspace: remove unused parameter 2023-09-13 23:00:15 +04:00
Petr Gusev
a03fbc3781 system_keyspace: set null sharder when configuring schema commitlog
The schema commitlog lives only on the null shard, it
makes no sense to set use_schema_commitlog
without use_null_sharder.

We also extract the function enable_schema_commitlog which
sets all the needed properties.
2023-09-13 23:00:15 +04:00
Petr Gusev
d32191a353 system_keyspace: rename static variables
'raft_tables' in set_use_schema_commitlog
initialization was misleading. Other variables have
also been renamed for consistency.
2023-09-13 23:00:15 +04:00