Commit Graph

454 Commits

Author SHA1 Message Date
Avi Kivity
d237d0a4ea Update seastar submodule
* seastar 71036ebcc0...5b95d1d798 (3):
  > rpc stream: do not abort stream queue if stream connection was closed without error
  > resource: fallback to sysconf when failed to detect memory size from hwloc
  > Merge 'scheduling_group: improve scheduling group creation exception safety' from Michael Litvak

scylla-gdb.py adjusted for scheduling_group_specific data structure
changes in Seastar. As part of that, a gratuitous dereference of
std::unique_ptr, which fails for std::unique_ptr<void*, ...>, was
removed.
2025-02-03 00:10:38 +02:00
Botond Dénes
b70dccb638 sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant
In the spirit of using standard-library types, instead of boost ones
where possible.
Although a disk type, it is serialized/deserialized with custom code, so
the change shouldn't cause any changes in the disk representation.
2025-01-27 09:29:26 -05:00
Botond Dénes
a095b3bb80 scylla-gdb.py: std_variant: fix get()
It calls self.get_with_type() with one too many params.
2025-01-27 09:29:26 -05:00
Piotr Dulikowski
7383013f43 replica/database: add reader concurrency semaphore groups
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.

Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.

The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
2025-01-02 07:13:34 +01:00
Botond Dénes
e942c074f2 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
2024-11-03 10:17:11 +02:00
Avi Kivity
b9df3aec12 gdb: avoid @classmethod/@property combinations
The @classmethod/@property combination was deprecated in Python 3.11
and removed[1] in Python 3.13. It's used in scylla-gdb.py, breaking it
with Python 3.13.

To fix, just make all users (size_t and _vptr_type) top-level
functions. The definitions are all identical and don't need to be
in class scope.

[1] https://docs.python.org/3.13/library/functions.html#classmethod

Closes scylladb/scylladb#21349
2024-10-29 19:37:07 +02:00
Wojciech Mitros
242079d70b mv: add a dedicated read concurrency semaphore for view update read before writes
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
2024-10-21 11:02:06 +02:00
Botond Dénes
38088daa1f scylla-gdb.py: drop compatibility code for EOL releases
Any release < 6.0 or < 2023.1 is EOL and need not be supported by
scylla-gdb.py anymore. Remove compatibility code for these releases.

Closes scylladb/scylladb#20918
2024-10-03 15:42:08 +03:00
Tomasz Grabiec
8e047e8fff gdb: Add std::set wrapper
Allows accessing std::set fields from gdb, e.g.:

(gdb) python for e in std_set(_promoted_index._blocks): print(e)

Closes scylladb/scylladb#20650
2024-09-20 08:24:15 +03:00
Tomasz Grabiec
e70ce4d6ed gdb: Introduce "scylla sstable-dump-cached-index" command 2024-09-17 14:41:18 +02:00
Tomasz Grabiec
9f0eed263d gdb: Introduce "scylla sstable-promoted-index" command 2024-09-17 14:41:13 +02:00
Tomasz Grabiec
2c463ead59 gdb: Fix range printer for singular ranges
Before, it printed [x, +inf) instead of {x}
2024-09-17 14:30:28 +02:00
Kefu Chai
7dd63c891f scylla-gdb.py: lazy-evaluate the constants
instead of evaluating the constants in-class, accessing them via
a cached class property.

it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available with
a file being debugged. so when gdb fails to load init script:

```
Traceback (most recent call last):
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
    class intrusive_slist:
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
    size_t = gdb.lookup_type('size_t')
             ^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```

so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.

the reason is that the script access the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.

in this change, we transform all these class variables to cached
property, so that they

* are evaluated on-demand
* are evaluated only once at most

this addresses the pain at the expense of verbosity.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-30 17:05:29 +08:00
Kefu Chai
5cffb23aa3 scylla-gdb.py: use chunked_fifo to represent _sink._pending_io
we switched from `circular_buffer` to `chunked_fifo` to present
`io_sink::_pending_io` in the latest seastar now. to be prepared for
this change, let's

* add `chunked_fifo` class in `scylla-gdb.py`.
* use `circular_buffer` as a fallback of `chunked_fifo`. instead of
  doing this the other way around, we try to send the message that
  the latest seastar uses `chunked_fifo`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20280
2024-08-27 08:44:56 +03:00
Botond Dénes
53a6ec05ed Merge 'replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Fixes #18821.

```
WRITE
=====

./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets --write

- BEFORE

65559.52 tps ( 59.6 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   52841 insns/op,   30946 cycles/op,        0 errors)
67408.05 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53018 insns/op,   30874 cycles/op,        0 errors)
67714.72 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53026 insns/op,   30881 cycles/op,        0 errors)
67825.57 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53015 insns/op,   30821 cycles/op,        0 errors)
67810.74 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53009 insns/op,   30828 cycles/op,        0 errors)

         throughput: mean=67263.72 standard-deviation=967.40 median=67714.72 median-absolute-deviation=547.02 maximum=67825.57 minimum=65559.52
instructions_per_op: mean=52981.61 standard-deviation=79.09 median=53014.96 median-absolute-deviation=36.54 maximum=53025.79 minimum=52840.56
  cpu_cycles_per_op: mean=30869.90 standard-deviation=50.23 median=30874.06 median-absolute-deviation=42.11 maximum=30945.94 minimum=30820.89

- AFTER
65448.76 tps ( 59.5 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   52788 insns/op,   31013 cycles/op,        0 errors)
67290.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53025 insns/op,   30950 cycles/op,        0 errors)
67646.81 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53025 insns/op,   30909 cycles/op,        0 errors)
67565.90 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53058 insns/op,   30951 cycles/op,        0 errors)
67537.32 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   52983 insns/op,   30963 cycles/op,        0 errors)

         throughput: mean=67097.93 standard-deviation=931.44 median=67537.32 median-absolute-deviation=467.97 maximum=67646.81 minimum=65448.76
instructions_per_op: mean=52975.85 standard-deviation=108.07 median=53024.55 median-absolute-deviation=49.45 maximum=53057.99 minimum=52788.49
  cpu_cycles_per_op: mean=30957.17 standard-deviation=37.43 median=30951.31 median-absolute-deviation=7.51 maximum=31013.01 minimum=30908.62

READ
=====

./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets

- BEFORE

79423.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41840 insns/op,   26820 cycles/op,        0 errors)
81076.70 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41837 insns/op,   26583 cycles/op,        0 errors)
80927.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41829 insns/op,   26629 cycles/op,        0 errors)
80539.44 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41841 insns/op,   26735 cycles/op,        0 errors)
80793.10 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41864 insns/op,   26662 cycles/op,        0 errors)

         throughput: mean=80551.99 standard-deviation=661.12 median=80793.10 median-absolute-deviation=375.37 maximum=81076.70 minimum=79423.36
instructions_per_op: mean=41842.20 standard-deviation=13.26 median=41840.14 median-absolute-deviation=5.68 maximum=41864.50 minimum=41829.29
  cpu_cycles_per_op: mean=26685.88 standard-deviation=93.31 median=26662.18 median-absolute-deviation=56.47 maximum=26820.08 minimum=26582.68

- AFTER
79464.70 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41799 insns/op,   26761 cycles/op,        0 errors)
80954.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41803 insns/op,   26605 cycles/op,        0 errors)
81160.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41811 insns/op,   26555 cycles/op,        0 errors)
81263.10 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41814 insns/op,   26527 cycles/op,        0 errors)
81162.97 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41806 insns/op,   26549 cycles/op,        0 errors)

         throughput: mean=80801.25 standard-deviation=755.54 median=81160.90 median-absolute-deviation=361.72 maximum=81263.10 minimum=79464.70
instructions_per_op: mean=41806.47 standard-deviation=5.85 median=41806.05 median-absolute-deviation=4.05 maximum=41813.86 minimum=41799.36
  cpu_cycles_per_op: mean=26599.22 standard-deviation=94.84 median=26554.54 median-absolute-deviation=50.51 maximum=26761.06 minimum=26527.05
```

Closes scylladb/scylladb#19469

* github.com:scylladb/scylladb:
  replica: remove rwlock for protecting iteration over storage group map
  replica: get rid of fragile compaction group intrusive list
2024-07-12 15:45:36 +03:00
Michał Chojnowski
fdd8b03d4b scylla-gdb.py: add $coro_frame()
Adds a convenience function for inspecting the coroutine frame of a given
seastar task.

Short example of extracting a coroutine argument:

```
(gdb) p *$coro_frame(seastar::local_engine->_current_task)
$1 = {
  __resume_fn = 0x2485f80 <sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&)>,
  ...
  PointerType_7 = 0x601008e67880,
  ...
  __coro_index = 0 '\000'
  ...
(gdb) p $downcast_vptr($->PointerType_7)
  $2 = (schema *) 0x601008e67880
```

Closes scylladb/scylladb#19479
2024-07-10 21:46:27 +03:00
Raphael S. Carvalho
c539b7c861 replica: remove rwlock for protecting iteration over storage group map
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.

Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.

Fixes #18821.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-09 16:59:24 -03:00
Botond Dénes
9544c364be scylla-gdb.py: introduce scylla large-objects
The equivalent of small-objects, but for large objects (spans).
Allows listing object of a large-class, and therefore investigating a
run-away class, by attempting to identify the owners of the objects in
it.

Written to investigate #16493

Closes scylladb/scylladb#16711
2024-07-09 10:21:09 +03:00
Michał Chojnowski
c7dc3b9b58 scylla-gdb.py: add line information to coroutine names in scylla fiber
For convenience.

Note that this line info only points to the function as a whole, not to the
current suspend point.
I think there's no facility for converting the `__coro_index` to the current suspend point automatically.

Before:

```
(gdb) scylla fiber seastar::local_engine->_current_task
[shard  1] #0  (task*) 0x0000601008e8e970 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is seastar::future<void> sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) [clone .resume] )
[shard  1] #1  (task*) 0x00006010092acf10 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) [clone .resume] )
[shard  1] #2  (task*) 0x0000601008e648d0 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const [clone .resume] )
```

After:

```
(gdb) scylla fiber seastar::local_engine->_current_task
[shard  1] #0  (task*) 0x0000601008e8e970 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) at sstables/sstables.cc:352)
[shard  1] #1  (task*) 0x00006010092acf10 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) at sstables/sstables.cc:570)
[shard  1] #2  (task*) 0x0000601008e648d0 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const at sstables/sstables.cc:992)

```

Closes scylladb/scylladb#19478
2024-06-25 13:55:10 +03:00
Avi Kivity
fdc1449392 treewide: rename flat_mutation_reader_v2 to mutation_reader
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:

  e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
  08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"

as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.

The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit

  026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"

In turn, flat_mutation_reader was introduced in 2017 in commit

  748205ca75 "Introduce flat_mutation_reader"

To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.

Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.

Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.

Some notes about the transition:

 - files were also renamed. In one case (flat_mutation_reader_test.cc), the
   rename target already existed, so we rename to
    mutation_reader_another_test.cc.

 - a namespace 'mutation_reader' with two definitions existed (in
   mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
   class. As a result, a few #includes had to be adjusted.

Closes scylladb/scylladb#19356
2024-06-21 07:12:06 +03:00
Michał Chojnowski
c901139d07 scylla-gdb.py: print coroutine names in scylla fiber
Enriches the output of `scylla fiber` with resolved names of coroutine resume functions.

Before:

```
[shard  2] #0  (task*) 0x0000602004c9fbf0 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
[shard  2] #1  (task*) 0x0000602000344c90 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
[shard  2] #2  (task*) 0x0000602004b30c50 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
```

After:

```
[shard  2] #0  (task*) 0x0000602004c9fbf0 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is seastar::future<void> sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) [clone .resume] )
[shard  2] #1  (task*) 0x0000602000344c90 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) [clone .resume] )
[shard  2] #2  (task*) 0x0000602004b30c50 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const [clone .resume] )
```

Closes scylladb/scylladb#19091
2024-06-04 22:32:17 +03:00
Marcin Maliszkiewicz
2ab143fb40 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769
2024-05-26 22:30:42 +03:00
Tomasz Grabiec
4d84451cf1 sstables, gdb: Track readers in a linked list
For the purpose of scylla-gdb.py command "scylla
active-sstables". Before the patch, readers were located by scanning
the heap for live objects with vtable pointers corresponding to
readers. It was observed that the test scylla_gdb/test_misc.py::test_active_sstables started failing like this:

  gdb.error: Error occurred in Python: Cannot access memory at address 0x300000000000000

This could be explained by there being a live object on the heap which
used to be a reader but now is a different object, and the _sst field
contains some other data which is not a pointer.

To fix, track readers explicitly in a linked list so that the gdb
script can reliably walk readers.

Fixes #18618.
2024-05-16 00:28:46 +02:00
Lakshmi Narayanan Sreethar
3ef2f79d14 sstable: renamed intrusive list link type
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Kefu Chai
0b0e661a85 build: bring abseil submodule back
because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689,
the rebuilt abseil package provided by fedora has different settings
than the ones if the tree is built with the sanitizer enabled. this
inconsistency leads to a crash.

to address this problem, we have to reinstate the abseil submodule, so
we can built it with the same compiler options with which we build the
tree.

in this change

* Revert "build: drop abseil submodule, replace with distribution abseil"
* update CMake building system with abseil header include settings
* bump up the abseil submodule to the latest LTS branch of abseil:
  lts_2024_01_16
* update scylla-gdb.py to adapt to the new structure of
  flat_hash_map

This reverts commit 8635d24424.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18511
2024-05-05 23:31:09 +03:00
Kefu Chai
3b50c39a83 scylla-gdb: access io_queue::_streams and io_queue::_fgs with static_vector
in seastar's b28342fa5a301de3facf5e83dc691524a6b20604, we switched
* `io_queue::_streams` from
  `boost::container::small_vector<fair_queue, 2>` to
  `boost::container::static_vector<fair_queue, 2>`
* `io_queue::_fgs` from
  `std::vector<std::unique_ptr<fair_group>>` to
  `boost::container::static_vector<fair_group, 2>`

so we need to update the gdb script accordingly to reflect this
change, and to avoid the nested try-except blocks, we switch to
a `while` statement to simplify the code structure.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18165
2024-04-04 11:39:10 +03:00
Kefu Chai
50c6fc1141 scylla-gdb: use current_scheduling_group_ptr instead of task_queue._current
Seastar removed `task_queue::_current` in
258b11220d343d8c7ae1a2ab056fb5e202723cc8 . let's adapt scylla-gdb.py
accordingly. despite that `current_scheduling_group_ptr()` is an internal
API, it's been around for a while, and relatively stable. so let's use
it instead.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17720
2024-03-11 13:13:59 +02:00
Nadav Har'El
a36c8b28dd Merge 'scylla-gdb.py: fixes warnings raised by flake8' from Kefu Chai
this changeset addresses some warnings raised by flake8 in hope to improve the readability of this script in general.

Closes scylladb/scylladb#17668

* github.com:scylladb/scylladb:
  scylla-gdb: s/if not foo is None/if foo is not None/
  scylla-gdb.py: add space after keyword
  scylla-gdb.py: remove extraneous spaces
  scylla-gdb.py: use 2 empty lines between top-level funcs/classes
  scylla-gdb.py: replace <tab> with 4 spaces
  scylla-gdb: fix the indent
2024-03-07 10:41:15 +02:00
Kefu Chai
4f8b618be7 scylla-gdb: s/if not foo is None/if foo is not None/
more readable this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Kefu Chai
643a6d5bda scylla-gdb.py: add space after keyword
it'd be more pythonic to just put an expression after `assert`,
instead of quoting it with a pair of parenthesis. and there is no need
to add `;` after `break`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Kefu Chai
8c65f92f1f scylla-gdb.py: remove extraneous spaces
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Kefu Chai
12c06c39c3 scylla-gdb.py: use 2 empty lines between top-level funcs/classes
and 1 empty line for nested functions/classes, to be more PEP8
compliant. for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Kefu Chai
8e3b22c76a scylla-gdb.py: replace <tab> with 4 spaces
do not mix tab and spaces for indent

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Kefu Chai
c4b679fe3b scylla-gdb: fix the indent
indent should be multiple of 4 spaces.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-03-07 13:46:38 +08:00
Avi Kivity
c5f01349b1 Merge 'Add specialized tablet_sstable_set' from Benny Halevy
Make a specialized sstable_set for tablets
via tablet_storage_group_manager::make_sstable_set.

This sstable set takes a snapshot of the storage_groups
(compound) sstable_sets and maps the selected tokens
directly into the tablet compound_sstable_set.

This sstable_set provides much more efficient access
to the table's sstable sets as it takes advantage of the disjointness
of sstable sets between tablets/storage_groups, and making it is cheaper
that rebuilding a complete partitioned_sstable_set from all sstables in the table.

Fixes #16876

Cassandra-stress setup:
```
$ sudo cpupower frequency-set -g userspace
$ build/release/scylla (developer-mode options) --smp=16 --memory=8G --experimental-features=consistent-topology-changes --experimental-features=tablets
cqlsh> CREATE KEYSPACE keyspace1 WITH replication={'class':'NetworkTopologyStrategy', 'replication_factor':1} AND tablets={'initial':2048};
$ ./tools/java/tools/bin/cassandra-stress write no-warmup n=10000000 -pop 'seq=1...10000000' -rate threads=128
$ scylla-api-client system drop_sstable_caches POST
$ ./tools/java/tools/bin/cassandra-stress read no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128
$ scylla-api-client system drop_sstable_caches POST
$ ./tools/java/tools/bin/cassandra-stress mixed no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128
```

Baseline (0a7854ea4d) vs. fix (0c2c00f01b)

Throughput (op/s):
workload | baseline | fix
---------|----------|----------
write | 76,806 | 100,787
read | 34,330 | 106,099
mixed | 32,195 | 79,246

Closes scylladb/scylladb#17149

* github.com:scylladb/scylladb:
  table: tablet_storage_group_manager: make tablet_sstable_set
  storage_group_manager: add make_sstable_set
  tablet_storage_group_manager: handle_tablet_split_completion: pre-calc new_tablet_count
  table: tablet_storage_group_manager: storage_group_of: do not validate in release build mode
  table: move compaction_group_list and storage_group_vector to storage_group_manager
  compaction_group::table_state: get_group_id: become self-sufficient
  compaction_group, table: make_compound_sstable_set: declare as const
  tablet_storage_group_manager: precalculate my_host_id and _tablet_map
  table: coroutinize update_effective_replication_map
2024-03-06 23:59:39 +02:00
Benny Halevy
7f203f0551 table: move compaction_group_list and storage_group_vector to storage_group_manager
So the storage_group_manager can be used later by table_sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-06 10:35:33 +02:00
Marcin Maliszkiewicz
9144d8203b db: add system_auth_v2 keyspace
New keyspace is added similarly as system_schema keyspace,
it's being registred via system_keyspace::make which calls
all_tables to build its schema.

Dummy table 'roles' is added as keyspaces are being currently
registered by walking through their tables. Full table schemas
will be added in subsequent commits.

Change can be observed via cqlsh:

cassandra@cqlsh> describe keyspaces;

system_auth_v2  system_schema       system         system_distributed_everywhere
system_auth     system_distributed  system_traces

cassandra@cqlsh> describe keyspace system_auth_v2;

CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'}  AND durable_writes = true;

CREATE TABLE system_auth_v2.roles (
    role text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = 'comment'
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 604800
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
2024-03-01 10:40:29 +01:00
Nadav Har'El
b0233c0833 Merge 'interval: rename nonwrapping_interval to interval' from Avi Kivity
Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token.

We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility.

Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping.

We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.

Closes scylladb/scylladb#17455

* github.com:scylladb/scylladb:
  interval: rename nonwrapping_interval to interval
  interval: rename interval_test to wrapping_interval_test
2024-02-22 14:03:43 +02:00
Kefu Chai
9ee728dab9 scylla-gdb: use raw string when '\' is not used in an escape sequence
when '\' does not start an escape sequence, Python complains at seeing
it. but it continues anyway by considering '\' as a separate char.
but the warning message is still annoying:

```
scylla-gdb.py: 2417: SyntaxWarning: invalid escape sequence '\-'
  branches = (r" |-- ", " \-- ")
```

when sourcing this script.

so, let's mark these strings as raw strings.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17466
2024-02-22 09:03:26 +02:00
Avi Kivity
51df8b9173 interval: rename nonwrapping_interval to interval
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.

We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.

Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.

We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
2024-02-21 19:43:17 +02:00
Michał Chojnowski
5a3e4a1cc0 utils: managed_bytes: optimize memory usage for small buffers
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.

This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.

To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
2024-02-09 20:56:20 +01:00
Lakshmi Narayanan Sreethar
76f0d5e35b reader_permit: store schema_ptr instead of raw schema pointer
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.

Fixes #16180

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#16658
2024-01-11 08:37:56 +02:00
Benny Halevy
cdd5605d81 gms: endpoint_state: change application_state_map to std::unordered_map
State changes are processed as a batch and
there is no reason to maintain them as an ordered map.
Instead, use a std::unordered_map that is more efficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Raphael S. Carvalho
15de1cdcbc replica: Introduce concept of storage group
Storage group is the storage of tablets. This new concept is helpful
for tablet splitting, where the storage of tablet will be split
in multiple compaction groups, where each can be compacted
independently.

The reason for not going with arena concept is that it added
complexity, and it felt much more elegant to keep compaction
group unchanged which at the end of the day abstracts the concept
of a set of sstables that can be compacted and operated
independently.

When splitting, the storage group for a tablet may therefore own
multiple compaction groups, left, right, and main, where main
keeps the data that needs splitting. When splitting completes,
only left and right compaction groups will be populated.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-12-17 11:40:09 -03:00
Avi Kivity
2b8392b8b8 Merge 'database, reader_concurrency_semaphore: deduplicate reader_concurrency_semaphore metrics ' from Botond Dénes
Reduce code duplication by defining each metric just once, instead of three times, by having the semaphore register metrics by itself. This also makes the lifecycle of metrics contained in that of the semaphore. This is important on enterprise where semaphores are added and removed, together with service levels.
We don't want all semaphores to export metrics, so a new parameter is introduced and all call-sites make a call whether they opt-in or not.

Fixes: https://github.com/scylladb/scylladb/issues/16402

Closes scylladb/scylladb#16383

* github.com:scylladb/scylladb:
  database, reader_concurrency_sempaphore: deduplicate reader_concurrency_sempaphore metrics
  reader_concurrency_semaphore: add register_metrics constructor parameter
  sstables: name sstables_manager
2023-12-14 18:26:24 +02:00
Avi Kivity
7fce057cda database, reader_concurrency_sempaphore: deduplicate reader_concurrency_sempaphore metrics
reader_concurrency_sempaphore are triplicated: each metrics is registered
for streaming, user, and system classes.

To fix, just move the metrics registration from database to
reader_concurrency_sempaphore, so each reader_concurrency_sempaphore
instantiated will register its metrics (if its creator asked for it).

Adjust the names given to reader_concurrency_sempaphore so we don't
change the labels.

scylla-gdb is adjusted to support the new names.
2023-12-13 09:16:18 -05:00
Yaniv Kaul
0b0a3ee7fc Typos: fix typos in code
Last batch, hopefully, sing codespell, went over the docs and fixed some typos.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#16388
2023-12-13 10:45:21 +02:00
Pavel Emelyanov
0c69a312db Update seastar submodule
* seastar bab1625c...17183ed4 (73):
  > thread_pool: Reference reactor, not point to
  > sstring: inherit publicly from string_view formatter
  > circleci: use conditional steps
  > weak_ptr: include used header
  > build: disable the -Wunused-* warnings for checkheaders
  > resource: move variable into smaller lexical scope
  > resource: use structured binding when appropriate
  > httpd: Added server and client addresses to request structure
  > io_queue: do not dereference moved-away shared pointer
  > treewide: explicitly define ctor and assignment operator
  > memory: use `err` for the error string
  > doc: Add document describing all the math behind IO scheduler
  > io_queue: Add flow-rate based self slowdown backlink
  > io_queue: Make main throttler uncapped
  > io_queue: Add queue-wide metrics
  > io_queue: Introduce "flow monitor"
  > io_queue: Count total number of dispatched and completed requests so far
  > io_queue: Introduce io_group::io_latency_goal()
  > tests: test the vector overload for when_all_succeed
  > core: add a vector overload to when_all_succeed
  > loop: Fix iterator_range_estimate_vector_capacity for random iters
  > loop: Add test for iterator_range_estimate_vector_capacity
  > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
  > memory: s/throw()/noexcept/
  > build: enable -Wdeprecated compiler option
  > reactor: mark kernel_completion's dtor protected
  > tests: always wait for promise
  > http, json, net: define-generated copy ctor for polymorphic types
  > treewide: do not define constexpr static out-of-line
  > reactor: do not define dtor of kernel_completion
  > http/exception: stop using dynamic exception specification
  > metrics: replace vector with deque
  > metrics: change metadata vector to deque
  > utils/backtrace.hh: make simple_backtrace formattable
  > reactor: Unfriend disk_config_params
  > reactor: Move add_to_flush_poller() to internal namespace
  > reactor: Unfriend a bunch of sched group template calls
  > rpc_test: Test rpc send glitches
  > net: Implement batch flush support for existing sockets
  > iostream: Configure batch flushes if sink can do it
  > net: Added remote address accessors
  > circleci: update the image to CircleCI "standard" image
  > build: do not add header check target if no headers to check
  > build: pass target name to seastar_check_self_contained
  > build: detect glibc features using CMake
  > build: extract bits checking libc into CheckLibc.cmake
  > http/exception: add formatter for httpd::base_exception
  > http/client: Mark write_body() const
  > http/client: Introduce request::_bytes_written
  > http/client: Mark maybe_wait_for_continue() const
  > http/client: Mark send_request_head() const
  > http/client: Detach setup_request()
  > http/api_docs: copy in api_docs's copy constructor
  > script: do not inherit from object
  > scripts: addr2line: change StdinBacktraceIterator to a function
  > scripts: addr2line: use yield instead defining a class
  > tests: skip tests that require backtrace if execinfo.h is not found
  > backtrace: check for existence of execinfo.h
  > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
  > core: add sleep_abortable instantiation for manual_clock
  > tls: Return EPIPE exception when writing to shutdown socket
  > http/client: Don't cache connection if server advertises it
  > http/client: Mark connection as "keep in cache"
  > core: fix strerror_r usage from glibc extension
  > reactor: access sigevent.sigev_notify_thread_id with a macro
  > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
  > reactor: replace __mode_t with mode_t
  > reactor: change sys/poll.h to posix poll.h
  > rpc: Add unit test for per-domain metrics
  > rpc: Report client connections metrics
  > rpc: Count dead client stats
  > rpc: Add seastar::rpc::metrics
  > rpc: Make public queues length getters

io-scheduler fixes
refs: #15312
refs: #11805

http client fixes
refs: #13736
refs: #15509

rpc fixes
refs: #15462

Closes scylladb/scylladb#15774
2023-10-19 20:52:37 +03:00
Benny Halevy
d00e49a1bb gossiper: keep and serve shared endpoint_state_ptr in map
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.

Fixes scylladb/scylladb#14799

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:34:36 +03:00
Pavel Emelyanov
5c95b1cb7f scylla-gdb: Remove _cost_capacity from fair-group debug
This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb

(see also ae6fdf1599)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15203
2023-08-29 16:13:50 +03:00