this change should silence the warning like
```
/home/kefu/dev/scylladb/repair/repair.cc:222:23: error: comparison of integers of different signs: 'int' and 'size_type' (aka 'unsigned long') [-Werror,-Wsign-compare]
222 | for (int i = 0; i < all.size(); i++) {
| ~ ^ ~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16867
Since a given tablet belongs to a single shard on both repair master and repair
followers, row level repair code needs to be changed to work on a single
shard for a given tablet. In order to tell the repair followers which
shard to work on, a dst_cpu_id value is passed over rpc from the repair
master.
A helper to get the dst shard id on the repair follower.
If the repair master specifies the shard id for the follower, use it.
Otherwise, the follower chooses one itself.
remove the unused #include headers from repair.hh, as they are not
directly used. after this change, task_manager_module.hh fails to
have access to stream_reason, so include it where it is used.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16618
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.
In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
database::get_token_metadata() is switched to token_metadata2.
get_all_ips method is added to the host_id-based token_metadata, since
its convenient and will be used in several places. It returns all current
nodes converted to inet_address by means of the topology
contained within token_metadata.
hint_sender::can_send: if the node has already left the
cluster we may not find its host_id. This case is handled
in the same way as if it's not a normal token owner - we
simply send a hint to all replicas.
In this commit we replace token_metadata with token_metadata2
in the erm interface and field types. To accommodate the change
some of strategy-related methods are also updated.
All the boost and topology tests pass with this change.
We've updated all the places where token_metadata
is mutated, and now we can progress to the next stage
of the refactoring - gradually switching the read
code paths.
The calculate_natural_endpoints function
is at the core of all of them. It decides to what nodes
the given token should be replicated to for the given
token_metadata. It has a lot of usages in various contexts,
we can't switch them all in one commit, so instead we
allowed the function to behave in both ways. If
use_host_id parameter is false, the function uses the provided
token_metadata as is and returns endpoint_set as a result.
If it's true, it uses get_new() on the provided token_metadata
and returns host_id_set as a result.
The scope of the whole refactoring is limited to the erm data
structure, its interface will be kept inet_address based for now.
This means we'll often need to resolve host_ids to inet_address-es
as soon as we got a result from calculated_natural_endpoints.
A new calculate_natural_ips function is added for convenience.
It uses the new token_metadata and immediately resolves
returned host_id-s to inet_address-es.
The auxiliary declarations natural_ep_type, set_type, vector_type,
get_self_id, select_tm are introduced only for the sake of
migration, they will be removed later.
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for
row_level_diff_detect_algorithm. but its operator<<() is preserved,
as we are still using our homebrew the generic formatter for
std::vector, and this formatter is still using operator<< for formatting
the elements in the vector.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16248
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
- remove some code that is obsolete in newer Scylla versions,
- fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless,
- refactor some code for better maintainability.
Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new.
Closesscylladb/scylladb#16162
* github.com:scylladb/scylladb:
test/pylib: log_browsing: fix type hint
migration_manager: take `abort_source&` in get_schema_for_read/write
migration_manager: inline merge_schema_in_background
migration_manager: remove unused merge_schema_from overload
migration_manager: assume `canonical_mutation` support
migration_manager: add `std::move` to avoid a copy
schema_tables: refactor `scylla_tables(schema_features)`
schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
system_keyspace: fix outdated comment
repair: Introduce small table optimization
*) Problem:
We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.
*) Solution:
To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.
- No token range to repair is needed by the user. It will repair all token
ranges automatically.
- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.
- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.
*) Performance:
1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```
The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X
2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Same test as above except 5ms delay is added to simulate multiple dc
network latency:
The time to repair is reduced from 333s to 0.2s.
333.26758 s / 0.22625381s = 1472.98
3)
3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency
Before:
```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```
The time to repair is reduced from 986s (16 minutes) to 1.6s
*) Summary
So, a more than 1000X difference is observed for this common usage of
system table repair procedure.
Fixes#16011
Refs #15159Closesscylladb/scylladb#15974
* github.com:scylladb/scylladb:
repair: Introduce small table optimization
repair: Convert put_row_diff_with_rpc_stream to use coroutine
*) Problem:
We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.
*) Solution:
To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.
- No token range to repair is needed by the user. It will repair all token
ranges automatically.
- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.
- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.
*) Performance:
1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```
The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X
2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Same test as above except 5ms delay is added to simulate multiple dc
network latency:
The time to repair is reduced from 333s to 0.2s.
333.26758 s / 0.22625381s = 1472.98
3)
3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency
Before:
```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```
The time to repair is reduced from 986s (16 minutes) to 1.6s
*) Summary
So, a more than 1000X difference is observed for this common usage of
system table repair procedure.
Fixes#16011
Refs #15159
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
Extract decorated_key.hh and ring_position.hh
out of i_partitioner.hh so they can be included
selectively, since i_partitioner.hh contains too much
bagage that is not always needed in full.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.
Fixes: #14330Closesscylladb/scylladb#15879
Exception thrown from row_level_repair::run does not show the root
cause of a failure making it harder to debug.
Add the internal exception contents to runtime_error message.
After the change the log will mention the real cause (last line), e.g.:
repair - repair[92db0739-584b-4097-b6e2-e71a66e40325]: 33 out of 132 ranges failed,
keyspace=system_distributed, tables={cdc_streams_descriptions_v2, cdc_generation_timestamps,
view_build_status, service_levels}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false,
failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_distributed,
cf=cdc_streams_descriptions_v2, range=(8720988750842579417,+inf))
(while cleaning up after seastar::abort_requested_exception (abort requested))
Closesscylladb/scylladb#15770
If abort is requsted during bootstrap then a node should exit normally.
To achieve so, abort_requested_exception should be thrown as main
handles it gracefully.
In data_sync_repair_task_impl::run exceptions from all shards are
wrapped together into std::runtime_exception and so they aren't
handled as they are supposed to.
Throw abort_requested_exception when shutdown was requested.
Throw abort_requested_exception also if repair::task_manager_module::is_aborted,
so that force_terminate_all_repair_sessions acts the same regardless
the state of the repair.
To maintain consistency do the same for user_requested_repair_task_impl.
Fixes: #15710.
Closesscylladb/scylladb#15722
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
Closesscylladb/scylladb#15506
Most of the time only the roots of tasks tree should be non internal.
Change default implementation of is_internal and delete overrides
consistent with it.
Closesscylladb/scylladb#15353
Node operations will be integrated with task manager and so node_ops
directory needs to be created. To have an access to node ops related
classes from task manager and preserve consistent naming, move
the classes to node_ops/node_ops_data.cc.
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.
Closes#15280
* github.com:scylladb/scylladb:
system_keyspace: Don't require snitch argument on start
system_keyspace: Don't cache local dc:rack pair
system_keyspace: Save local info with explicit location
storage_service: Get endpoint location from snitch, not system keyspace
snitch: Introduce and use get_location() method
repair: Local location variables instead of system keyspace's one
repair: Use full endpoint location instead of datacenter part
Find progress of repair tasks based on the number of ranges
that have been repaired.
Fixes: [#1156](https://github.com/scylladb/scylla-enterprise/issues/1156).
Closes#14698
* github.com:scylladb/scylladb:
test: repair tasks test
repair: add methods making repair progress more precise
tasks: make progress related methods virtual
repair: add get_progress method to shard_repair_task_impl
repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size()
repair: log a name of a particular table repair is working on
tasks: delete move and copy constructors from task_manager::task::impl
Previous patch made full endpoint location be available as a local
variable near the places that get this location from the system
keyspace. This patch replaces the sys.ks. calls with the variables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places in repair code that get datacenter from the
topology. Nearby there are calls to update_topology() which, in turn,
needs full location ({dc, rack} pair). This patch makes the former
places obtain full location from topology and get the dc part from it.
This is needed as a preparation to let latter places use that location.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Override methods returning expected children number and job size
in repair tasks. With them get_progress method would be able to
return more precise progress value.