scylla

Author	SHA1	Message	Date
Gleb Natapov	a429018a8a	migration_manager: add wait_for_schema_agreement() function Several subsystems re-implement the same logic for waiting for schema agreement. Provide the function in the migration_manager and use it instead.	2023-05-25 14:44:53 +03:00
Pavel Emelyanov	5aea6938ae	commitlog: Introduce and use comitlog sched group Nowadays all commitlog code runs in whatever sched group it's kicked from. Since IO prio classes are going to be inherited from the current sched group the commitlog IO loops should be moved into commitlog sched group, not inherit a "random" one. There are currently two places that need correct context for IO -- the .cycle() method and segments replenisher. `$ perf-simple-query --write -c2` results --- Before the patch --- 194898.36 tps ( 56.3 allocs/op, 12.7 tasks/op, 54307 insns/op, 0 errors) 199286.23 tps ( 56.2 allocs/op, 12.7 tasks/op, 54375 insns/op, 0 errors) 199815.84 tps ( 56.2 allocs/op, 12.7 tasks/op, 54377 insns/op, 0 errors) 198260.98 tps ( 56.3 allocs/op, 12.7 tasks/op, 54380 insns/op, 0 errors) 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median absolute deviation: 713.36 maximum: 199815.84 minimum: 194898.36 --- After the patch --- 194751.80 tps ( 56.3 allocs/op, 12.7 tasks/op, 54331 insns/op, 0 errors) 199084.70 tps ( 56.2 allocs/op, 12.7 tasks/op, 54389 insns/op, 0 errors) 195551.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54385 insns/op, 0 errors) 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) 198710.00 tps ( 56.3 allocs/op, 12.7 tasks/op, 54387 insns/op, 0 errors) median 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) median absolute deviation: 1131.24 maximum: 199084.70 minimum: 194751.80 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14005	2023-05-23 21:25:57 +03:00
Tomasz Grabiec	809ddd7f79	Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`. We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`. At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring. The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key instead of duplicating them for each `vnode` token. The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation. Closes #13715 * github.com:scylladb/scylladb: token_metadata_test: add a test for everywhere strategy token_metadata_test: check read_endpoints when bootstrapping first node token_metadata_test: refactor tests, extract create_erm token_metadata: drop has_pending_ranges and migration_info effective_replication_map: add has_pending_ranges token_metadata: drop update_pending_ranges effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading calculate_effective_replication_map: compute pending_endpoints and read_endpoints vnode_erm: optimize replication_map vnode_erm::get_range_addresses: use sorted_tokens abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token sequenced_set: add extract_vector method effective_replication_map: clone_endpoints_gently -> clone_data_gently vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints stall_free.hh: add clear_gently for rvalues stall_free.hh: relax Container requirement token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map token_metadata: introduce topology_change_info token_metadata: replace set_topology_transition_state with set_read_new	2023-05-22 21:37:06 +02:00
Tomasz Grabiec	9d4bca26cc	Merge 'raft topology: implement `check_and_repair_cdc_streams` API' from Kamil Braun `check_and_repair_cdc_streams` is an existing API which you can use when the current CDC generation is suboptimal, e.g. after you decommissioned a node the current generation has more stream IDs than you need. In that case you can do `nodetool checkAndRepairCdcStreams` to create a new generation with fewer streams. It also works when you change number of shards on some node. We don't automatically introduce a new generation in that case but you can use `checkAndRepairCdcStreams` to create a new generation with restored shard-colocation. This PR implements the API on top of raft topology, it was originally implemented using gossiper. It uses the `commit_cdc_generation` topology transition state and a new `publish_cdc_generation` state to create new CDC generations in a cluster without any nodes changing their `node_state`s in the process. Closes #13683 * github.com:scylladb/scylladb: docs: update topology-over-raft.md test: topology_experimental_raft: test `check_and_repair_cdc` API raft topology: implement `check_and_repair_cdc_streams` API raft topology: implement global request handling raft topology: introduce `prepare_new_cdc_generation_data` raft_topology: `get_node_to_work_on_opt`: return guard if no node found raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition raft topology: separate `publish_cdc_generation` state raft topology: non-node-specific `exec_global_command` raft topology: introduce `start_operation()` raft topology: non-node-specific `topology_mutation_builder` topology_state_machine: introduce `global_topology_request` topology_state_machine: use `uint16_t` for `enum_class`es raft topology: make `new_cdc_generation_data_uuid` topology-global	2023-05-22 11:33:58 +02:00
Petr Gusev	87307781c4	effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading We already use the new pending_endpoints from erm though the get_pending_ranges virtual function, in this commit we update all the remaining places to use the new implementation in erm, as well as remove the old implementation in token_metadata.	2023-05-21 13:17:42 +04:00
Kamil Braun	13df85ea11	Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov This implicit link it pretty bad, because feature service is a low-level one which lots of other services depend on. System keyspace is opposite -- a high-level one that needs e.g. query processor and database to operate. This inverse dependency is created by the feature service need to commit enabled features' names into system keyspace on cluster join. And it uses the qctx thing for that in a best-effort manner (not doing anything if it's null). The dependency can be cut. The only place when enabled features are committed is when gossiper enables features on join or by receiving state changes from other nodes. By that time the sharded<system_keyspace> is up and running and can be used. Despite gossiper already has system keyspace dependency, it's better not to overload it with the need to mess with enabling and persisting features. Instead, the feature_enabler instance is equipped with needed dependencies and takes care of it. Eventually the enabler is also moved to feature_service.cc where it naturally belongs. Fixes: #13837 Closes #13172 * github.com:scylladb/scylladb: gossiper: Remove features and sysks from gossiper system_keyspace: De-static save_local_supported_features() system_keyspace: De-static load_\|save_local_enabled_features() system_keyspace: Move enable_features_on_startup to feature_service (cont) system_keyspace: Move enable_features_on_startup to feature_service feature_service: Open-code persist_enabled_feature_info() into enabler gms: Move feature enabler to feature_service.cc gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() gms: Persist features explicitly in features enabler feature_service: Make persist_enabled_feature_info() return a future system_keyspace: De-static load_peer_features() gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features() gossiper: Enable features and register enabler from outside gms: Add feature_service and system_keyspace to feature_enabler	2023-05-18 18:21:06 +02:00
Pavel Emelyanov	5216dcb1b3	Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`. This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups. This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR. Refs: #11870 Closes #13869 * github.com:scylladb/scylladb: db/system_keyspace: remove dependency on storage_proxy db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent replica: add query.hh	2023-05-18 10:53:27 +03:00
Pavel Emelyanov	29fffaa160	schema_tables: Use sharded<database>& variable The auto& db = proxy.local().get_db() is called few lines above this patch, so the &db can be reused for invoke_on_all() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13896	2023-05-16 12:57:47 +03:00
Botond Dénes	0cff0ffa08	Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai before this change, alternator_timeout_in_ms is not live-updatable, as after setting executor's default timeout right before creating sharded executor instances, they never get updated with this option anymore. but many users would like to set the driver timers based on server timers. we need to enable them to configure timeout even when the server is still running. in this change, * `alternator_timeout_in_ms` is marked as live-updateable * `executor::_s_default_timeout` is changed to a thread_local variable, so it can be updated by a per-shard updateable_value. and it is now a updateable_value, so its variable name is updated accordingly. this value is set in the ctor of executor, and it is disconnected from the corresponding named_value<> option in the dtor of executor. * alternator_timeout_in_ms is passed to the constructor of executor via sharded_parameter, so `executor::_timeout_in_ms` can be initialized on per-shard basis * `executor::set_default_timeout()` is dropped, as we already pass the option to executor in its ctor. Fixes #12232 Closes #13300 * github.com:scylladb/scylladb: alternator: split the param list of executor ctor into multi lines alternator,config: make alternator_timeout_in_ms live-updateable	2023-05-15 10:16:29 +03:00
Piotr Dulikowski	760651b4ad	error injection: allow enabling injections via config Currently, error injections can be enabled either through HTTP or CQL. While these mechanisms are effective for injecting errors after a node has already started, it can't be reliably used to trigger failures shortly after node start. In order to support this use case, this commit adds possibility to enable some error injections via config. A configuration option `error_injections_at_startup` is added. This option uses our existing configuration framework, so it is possible to supply it either via CLI or in the YAML configuration file. - When passed in commandline, the option is parsed as a semicolon-separated list of error injection names that should be enabled. Those error injections are enabled in non-oneshot mode. The CLI option is marked as not used in release mode and does not appear in the option list. Example: --error-injections-at-startup failure_point1;failure_point2 - When provided in YAML config, the option is parsed as a list of items. Each item is either a string or a map or parameters. This method is more flexible as it allows to provide parameters for each injection point. At this time, the only benefit is that it allows enabling points in oneshot mode, but more parameters can be added in the future if needed. Explanatory example: error_injections_at_startup: - failure_point1 # enabled in non-oneshot mode - name: failure_point2 # enabled in oneshot mode one_shot: true # due to one_shot optional parameter The primary goal of this feature is to facilitate testing of raft-based cluster features. An error injection will be used to enable an additional feature to simulate node upgrade. Tests: manual Closes #13861	2023-05-15 09:14:07 +03:00
Botond Dénes	157fdb2f6d	db/system_keyspace: remove dependency on storage_proxy The methods that take storage_proxy as argument can now accept a replica::database instead. So update their signatures and update all callers. With that, system_keyspace.* no longer depends on storage_proxy directly.	2023-05-12 07:27:55 -04:00
Botond Dénes	f4f757af23	db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent Use the recently introduced replica side query utility functions to query the content of the system tables. This allows us to cut the dependency of the system keyspace on storage proxy. The methods still take storage proxy parameter, this will be replaced with replica::database in the next patch. There is still one hidden storage proxy dependency left, via clq3::query_processor. This will be addressed later.	2023-05-12 07:27:55 -04:00
Wojciech Mitros	9ae1b02144	service: revoke permissions on functions when a function/keyspace is dropped Currently, when a user has permissions on a function/all functions in keyspace, and the function/keyspace is dropped, the user keeps the permissions. As a result, when a new function/keyspace is created with the same name (and signature), they will be able to use it even if no permissions on it are granted to them. Simliarly to regular UDFs, the same applies to UDAs. After this patch, the corresponding permissions on functions are dropped when a function/keyspace is dropped. Fixes #13820 Closes #13823	2023-05-10 14:39:42 +03:00
Petr Gusev	052b91fb1f	storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading We are going to use remapped_endpoints_for_reading, we need to make sure we use it in the right place. The get_live_sorted_endpoints function looks like what we need - it's used in all read code paths. From its name, however, this was not obvious. Also, we add the parameter ks_name as we'll need it to pass to remapped_endpoints_for_reading.	2023-05-09 18:42:03 +04:00
Kamil Braun	acfb6bf3ed	topology_state_machine: introduce `global_topology_request` `topology` currently contains the `requests` map, which is suitable for node-specific requests such as "this node wants to join" or "this node must be removed". But for requests for operations that affect the cluster as a whole, a separate request type and field is more appropriate. Introduce one. The enum currently contains the option `new_cdc_generation` for requests to create a new CDC generation in the cluster. We will implement the whole procedure in later commits.	2023-05-08 16:46:14 +02:00
Kamil Braun	93dcdcd4eb	raft topology: make `new_cdc_generation_data_uuid` topology-global - make it a static column in `system.topology` - move it from node-specific `ring_slice` to cluster-global `topology` We will use it in scenarios where no node is transitioning. Also make it `std::optional` in topology for consistency with other fields (previously, the 'no value' state for this field was represented using default-constructed `utils::UUID`).	2023-05-08 16:46:14 +02:00
Kefu Chai	5fa459bd1a	treewide: do not include unused header since #13452, we switched most of the caller sites from std::regex to boost::regex. in this change, all occurences of `#include <regex>` are dropped unless std::regex is used in the same source file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13765	2023-05-07 19:01:29 +03:00
Avi Kivity	f125a3e315	Merge 'tree: finish the reader_permit state renames' from Botond Dénes In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`. This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date. Closes #13573 * github.com:scylladb/scylladb: reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes reader_concurrency_semaphore: update API w.r.t. recent permit state name changes reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes	2023-05-04 18:29:04 +03:00
Avi Kivity	1d351dde06	Merge 'Make S3 client work with real S3' from Pavel Emelyanov Current S3 client was tested over minio and it takes few more touches to work with amazon S3. The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS. So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs. In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust. fixes: #13425 Closes #13493 * github.com:scylladb/scylladb: doc: Add a document describing how to configure S3 backend s3/test: Add ability to run boost test over real s3 s3/client: Sign requests if configured s3/client: Add connection factory with DNS resolve and configurable HTTPS s3/client: Keep server port on config s3/client: Construct it with config s3/client: Construct it with sstring endpoint sstables: Make s3_storage with endpoint config sstables_manager: Keep object storage configs onboard code: Introduce conf/object_storage.yaml configuration file	2023-05-04 18:08:54 +03:00
Pavel Emelyanov	2f6aa5b52e	code: Introduce conf/object_storage.yaml configuration file In order to access real S3 bucket, the client should use signed requests over https. Partially this is due to security considerations, partially this is unavoidable, because multipart-uploading is banned for unsigned requests on the S3. Also, signed requests over plain http require signing the payload as well, which is a bit troublesome, so it's better to stick to secure https and keep payload unsigned. To prepare signed requests the code needs to know three things: - aws key - aws secret - aws region name The latter could be derived from the endpoint URL, but it's simpler to configure it explicitly, all the more so there's an option to use S3 URLs without region name in them we could want to use some time. To keep the described configuration the proposed place is the object_storage.yaml file with the format endpoints: - name: a.b.c port: 443 aws_key: 12345 aws_secret: abcdefghijklmnop ... When loaded, the map gets into db::config and later will be propagated down to sstables code (see next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:15 +03:00
Botond Dénes	48b9f31a08	Merge 'db, sstable: use generation_type instead of its value when appropriate' from Kefu Chai in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id. Closes #13652 * github.com:scylladb/scylladb: db: use uuid for the generation column in sstable registry table db, sstable: add operator data_value() for generation_type db, sstable: print generation instead of its value	2023-05-03 09:04:54 +03:00
Kefu Chai	74e9e6dd1a	db: use uuid for the generation column in sstable registry table * change the "generation" column of sstable registry table from bigint to uuid * from helper to convert UUID back to the original generation in the long run, we encourage user to use uuid based generation identifier. but in the transition period, both bigint based and uuid based identifiers are used for the generation. so to cater both needs, we use a hackish way to store the integer into UUID. to differentiate the was-integer UUID from the geniune UUID, we check the UUID's most_significant_bits. because we only support serialize UUID v1, so if the timestamp in the UUID is zero, we assume the UUID was generated from an integer when converting it back to a generation identififer. also, please note, the only use case of using generation as a column is the sstable_registry table, but since its schema is fixed, we cannot store both a bigint and a UUID as the value of its `generation` column, the simpler way forward is to use a single type for the generation. to be more efficient and to preserve the type of the generation, instead of using types like ascii string or bytes, we will always store the generation as a UUID in this table, if the generation's identifier is a int64_t, the value of the integer will be used as the least significant bits of the UUID. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-02 19:23:22 +08:00
Kefu Chai	135b4fd434	db: schema_tables: capture reference to temporary value by value `clustering_key_columns()` returns a range view, and `front()` returns the reference to its first element. so we cannot assume the availability of this reference after the expression is evaluated. to address this issue, let's capture the returned range by value, and keep the first element by reference. this also silences warning from GCC-13: ``` /home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ^~~~~~~~~~~~~ /home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’ 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ ``` Fixes #13720 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13721	2023-05-02 11:42:43 +03:00
Avi Kivity	c9dab3ac81	Merge 'treewide: fix warnings from GCC-13' from Kefu Chai this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately. see also #13243 Closes #13723 * github.com:scylladb/scylladb: cdc: initialize an optional using its value type compaction: disambiguate type name db: schema_tables: drop unused variable reader_concurrency_semaphore: fix signed/unsigned comparision locator: topology: disambiguate type names raft: disambiguate promise name in raft::awaited_conf_changes	2023-05-01 22:48:00 +03:00
Tomasz Grabiec	aba5667760	Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun We change the meaning and name of `replication_state`: previously it was meant to describe the "state of tokens" of a specific node; now it describes the topology as a whole - the current step in the 'topology saga'. It was moved from `ring_slice` into `topology`, renamed into `transition_state`, and the topology coordinator code was modified to switch on it first instead of node state - because there may be no single transitioning node, but the topology itself may be transitioning. This PR was extracted from #13683, it contains only the part which refactors the infrastructure to prepare for non-node specific topology transitions. Closes #13690 * github.com:scylladb/scylladb: raft topology: rename `update_replica_state` -> `update_topology_state` raft topology: remove `transition_state::normal` raft topology: switch on `transition_state` first raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` raft topology: parse replaced node in `exec_global_command` raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on` storage_service: extract raft topology coordinator fiber to separate class raft topology: rename `replication_state` to `transition_state` raft topology: make `replication_state` a topology-global state	2023-04-30 10:55:24 +02:00
Kefu Chai	56511a42d0	db: schema_tables: drop unused variable this also silence the warning from GCC-13: ``` /home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable] 1489 \| auto ts = db_clock::now(); \| ^~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Kefu Chai	ba8402067f	db, sstable: add operator data_value() for generation_type so we can apply `execute_cql()` on `generation_type` directly without extracting its value using `generation.value()`. this paves the road to adding UUID based generation id to `generation_type`. as by then, we will have both UUID based and integer based `generation_type`, so `generation_type::value()` will not be able to represent its value anymore. and this method will be replaced by `operator data_value()` in this use case. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Kefu Chai	ae9aa9c4bd	db, sstable: print generation instead of its value this change prepares for the change to use `variant<UUID, int64_t>` as the value of `generation_type`. as after this change, the "value" of a generation would be a UUID or an integer, and we don't want to expose the variant in generation's public interface. so the `value()` method would be changed or removed by then. this change takes advantage of the fact that the formatter of `generation_type` always prints its value. also, it's better to reuse `generation_type` formatter when appropriate. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Kamil Braun	22ab5982e7	raft topology: remove `transition_state::normal` What this state really represented is that there is currently no transition. So remove it and make `transition_state` optional instead.	2023-04-27 15:18:32 +02:00
Kamil Braun	defa63dc20	raft topology: rename `replication_state` to `transition_state` The new name is more generic - it describes the current step of a 'topology saga` (a sequence of steps used to implement a larger topology operation such as bootstrap).	2023-04-27 11:39:38 +02:00
Kamil Braun	af1ea2bb16	raft topology: make `replication_state` a topology-global state Previously it was part of `ring_slice`, belonging to a specific node. This commit moves it into `topology`, making it a cluster-global property. The `replication_state` column in `system.topology` is now `static`. This will allow us to easily introduce topology transition states that do not refer to any specific node. `commit_cdc_generation` will be such a state, allowing us to commit a new CDC generation even though all nodes are normal (none are transitioning). One could argue that the other states are conceptually already cluster-global: for example, `write_both_read_new` doesn't affect only the tokens of a bootstrapping (or decommissioning etc.) node; it affects replica sets of other tokens as well (with RFs greater than 1).	2023-04-27 11:39:38 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Kefu Chai	f5b05cf981	treewide: use defaulted operator!=() and operator==() in C++20, compiler generate operator!=() if the corresponding operator==() is already defined, the language now understands that the comparison is symmetric in the new standard. fortunately, our operator!=() is always equivalent to `! operator==()`, this matches the behavior of the default generated operator!=(). so, in this change, all `operator!=` are removed. in addition to the defaulted operator!=, C++20 also brings to us the defaulted operator==() -- it is able to generated the operator==() if the member-wise lexicographical comparison. under some circumstances, this is exactly what we need. so, in this change, if the operator==() is also implemented as a lexicographical comparison of all memeber variables of the class/struct in question, it is implemented using the default generated one by removing its body and mark the function as `default`. moreover, if the class happen to have other comparison operators which are implemented using lexicographical comparison, the default generated `operator<=>` is used in place of the defaulted `operator==`. sometimes, we fail to mark the operator== with the `const` specifier, in this change, to fulfil the need of C++ standard, and to be more correct, the `const` specifier is added. also, to generate the defaulted operator==, the operand should be `const class_name&`, but it is not always the case, in the class of `version`, we use `version` as the parameter type, to fulfill the need of the C++ standard, the parameter type is changed to `const version&` instead. this does not change the semantic of the comparison operator. and is a more idiomatic way to pass non-trivial struct as function parameters. please note, because in C++20, both operator= and operator<=> are symmetric, some of the operators in `multiprecision` are removed. they are the symmetric form of the another variant. if they were not removed, compiler would, for instance, find ambiguous overloaded operator '=='. this change is a cleanup to modernize the code base with C++20 features. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13687	2023-04-27 10:24:46 +03:00
Tomasz Grabiec	ce94a2a5b0	Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun Fix two issues with the replace operation introduced by recent PRs. Add a test which performs a sequence of basic topology operations (bootstrap, decommission, removenode, replace) in a new suite that enables the `raft` experimental feature (so that the new topology change coordinator code is used). Fixes: #13651 Closes #13655 * github.com:scylladb/scylladb: test: new suite for testing raft-based topology test: remove topology_custom/test_custom.py raft topology: don't require new CDC generation UUID to always be present raft topology: include shard_count/ignore_msb during replace	2023-04-26 11:38:07 +02:00
Pavel Emelyanov	5cbc8fe2f9	system_keyspace: De-static save_local_supported_features() That's, in fact, an independent change, because feature enabler doesn't need this method. So this patch is like "while at it" thing, but on the other hand it ditches one more qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:04:54 +03:00
Pavel Emelyanov	a5bd6cc832	system_keyspace: De-static load_\|save_local_enabled_features() All callers now have the system keyspace instance at hand. Unfortunately, this de-static doesn't allow more qctx drops, because both methods use set_\|get_scylla_local_param helpers that do use qctx and are still in use by other static methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:03:09 +03:00
Pavel Emelyanov	9bfbcaa3f6	system_keyspace: Move enable_features_on_startup to feature_service (cont) Now move the code itself. No functional changes here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:02:38 +03:00
Pavel Emelyanov	858db9f706	system_keyspace: Move enable_features_on_startup to feature_service This code belongs to feature service, system keyspace shoulnd't be aware of any pecularities of startup features enabling, only loading and saving the feature lists. For now the move happens only in terms of code declarations, the implementation is kept in its old place to reduce the patch churn. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:00:30 +03:00
Pavel Emelyanov	1ee04e4934	system_keyspace: De-static load_peer_features() This makes use of feature_enabler::_sys_ks dependency and gets rid of one more global qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:50:00 +03:00
Kamil Braun	3f0498ca53	raft topology: don't require new CDC generation UUID to always be present During node replace we don't introduce a new CDC generation, only during regular bootstrap. Instead of checking that `new_cdc_generation_uuid` must be present whenever there's a topology transition, only check it when we're in `commit_cdc_generation` state.	2023-04-24 14:41:33 +02:00
Tomasz Grabiec	41e69836fd	db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	9d786c1ebc	db: tablets: Add persistence layer	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	9b17ad3771	locator: Introduce per-table replication strategy Will be used by tablet-based replication strategies, for which effective replication map is different per table. Also, this patch adapts existing users of effective replication map to use the per-table effective replication map. For simplicity, every table has an effective replication map, even if the erm is per keyspace. This way the client code can be uniform and doesn't have to check whether replication strategy is per table. Not all users of per-keyspace get_effective_replication_map() are adapted yet to work per-table. Those algorithms will throw an exception when invoked on a keyspace which uses per-table replication strategy.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	9781d3ffc5	db: config: Introduce experimental "TABLETS" feature	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	bf2ce8ff75	config: Reformat	2023-04-24 10:49:36 +02:00
Botond Dénes	2d8d8043be	Merge 'Coroutinize system_keyspace::get_compaction_history' from Pavel Emelyanov Closes #13620 * github.com:scylladb/scylladb: system_keyspace: Fix indentation after previous patch system_keyspace: Coroutinize get_compaction_history()	2023-04-24 09:48:01 +03:00
Botond Dénes	85abece927	Merge 'Restrict logging of current_backtrace to log_level' from Benny Halevy `seastar::current_backtrace()` can be quite heavey. When we pass it to a log message in relatively detailed log_level (debug/trace), we pay the price of `current_backtrace` every time, but we rarely print the message. Closes #13527 * github.com:scylladb/scylladb: locator/topology: call seastar::current_backtrace only when log_level is enabled schema_tables: call seastar::current_backtrace only when log_level is enabled	2023-04-24 08:50:32 +03:00
Botond Dénes	7f04d8231d	Merge 'gms: define and use generation and version types' from Benny Halevy This series cleans up the generation and value types used in gms / gossiper. Currently we use a blend of int, int32_t, and int64_t around messaging. This change defines gms::generation_type and gms::version_type as int32_t and add check in non-release modes that the respective int64 value passed over messaging do not overflow 32 bits. Closes #12966 * github.com:scylladb/scylladb: gossiper: version_generator: add {debug_,}validate_gossip_generation gms: gossip_digest: use generation_type and version_type gms: heart_beat_state: use generation_type and version_type gms: versioned_value: use version_type gms: version_generator: define version_type and generation_type strong types utils: move generation-number to gms utils: add tagged_integer gms: versioned_value: make members private scylla-gdb: add get_gms_versioned_value gms: versioned_value: delete unused compare_to function gms: gossip_digest: delete unused compare_to function	2023-04-24 08:44:48 +03:00
Pavel Emelyanov	5e201b9120	database: Remove compaction_manager.hh inclusion into database.hh The only reason why it's there (right next to compaction_fwd.hh) is because the database::table_truncate_state subclass needs the definition of compaction_manager::compaction_reenabler subclass. However, the former sub is not used outside of database.cc and can be defined in .cc. Keeping it outside of the header allows dropping the compaction_manager.hh from database.hh thus greatly reducing its fanout over the code (from ~180 indirect inclusions down to ~20). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13622	2023-04-23 16:27:11 +03:00
Benny Halevy	2d20ee7d61	gms: version_generator: define version_type and generation_type strong types Derived from utils::tagged_integer, using different tags, the types are incompatible with each other and require explicit typecasting to- and from- their value type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:47:17 +03:00

1 2 3 4 5 ...

3092 Commits