* seastar 71036ebcc0...5b95d1d798 (3):
> rpc stream: do not abort stream queue if stream connection was closed without error
> resource: fallback to sysconf when failed to detect memory size from hwloc
> Merge 'scheduling_group: improve scheduling group creation exception safety' from Michael Litvak
scylla-gdb.py adjusted for scheduling_group_specific data structure
changes in Seastar. As part of that, a gratuitous dereference of
std::unique_ptr, which fails for std::unique_ptr<void*, ...>, was
removed.
In the spirit of using standard-library types, instead of boost ones
where possible.
Although a disk type, it is serialized/deserialized with custom code, so
the change shouldn't cause any changes in the disk representation.
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.
Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.
The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
The @classmethod/@property combination was deprecated in Python 3.11
and removed[1] in Python 3.13. It's used in scylla-gdb.py, breaking it
with Python 3.13.
To fix, just make all users (size_t and _vptr_type) top-level
functions. The definitions are all identical and don't need to be
in class scope.
[1] https://docs.python.org/3.13/library/functions.html#classmethodClosesscylladb/scylladb#21349
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
Any release < 6.0 or < 2023.1 is EOL and need not be supported by
scylla-gdb.py anymore. Remove compatibility code for these releases.
Closesscylladb/scylladb#20918
instead of evaluating the constants in-class, accessing them via
a cached class property.
it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available with
a file being debugged. so when gdb fails to load init script:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
class intrusive_slist:
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
size_t = gdb.lookup_type('size_t')
^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```
so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.
the reason is that the script access the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.
in this change, we transform all these class variables to cached
property, so that they
* are evaluated on-demand
* are evaluated only once at most
this addresses the pain at the expense of verbosity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we switched from `circular_buffer` to `chunked_fifo` to present
`io_sink::_pending_io` in the latest seastar now. to be prepared for
this change, let's
* add `chunked_fifo` class in `scylla-gdb.py`.
* use `circular_buffer` as a fallback of `chunked_fifo`. instead of
doing this the other way around, we try to send the message that
the latest seastar uses `chunked_fifo`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20280
Adds a convenience function for inspecting the coroutine frame of a given
seastar task.
Short example of extracting a coroutine argument:
```
(gdb) p *$coro_frame(seastar::local_engine->_current_task)
$1 = {
__resume_fn = 0x2485f80 <sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&)>,
...
PointerType_7 = 0x601008e67880,
...
__coro_index = 0 '\000'
...
(gdb) p $downcast_vptr($->PointerType_7)
$2 = (schema *) 0x601008e67880
```
Closesscylladb/scylladb#19479
rwlock was added to protect iterations against concurrent updates to the map.
the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).
the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).
to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.
Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.
Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.
Fixes#18821.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The equivalent of small-objects, but for large objects (spans).
Allows listing object of a large-class, and therefore investigating a
run-away class, by attempting to identify the owners of the objects in
it.
Written to investigate #16493Closesscylladb/scylladb#16711
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
For the purpose of scylla-gdb.py command "scylla
active-sstables". Before the patch, readers were located by scanning
the heap for live objects with vtable pointers corresponding to
readers. It was observed that the test scylla_gdb/test_misc.py::test_active_sstables started failing like this:
gdb.error: Error occurred in Python: Cannot access memory at address 0x300000000000000
This could be explained by there being a live object on the heap which
used to be a reader but now is a different object, and the _sst field
contains some other data which is not a pointer.
To fix, track readers explicitly in a linked list so that the gdb
script can reliably walk readers.
Fixes#18618.
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689,
the rebuilt abseil package provided by fedora has different settings
than the ones if the tree is built with the sanitizer enabled. this
inconsistency leads to a crash.
to address this problem, we have to reinstate the abseil submodule, so
we can built it with the same compiler options with which we build the
tree.
in this change
* Revert "build: drop abseil submodule, replace with distribution abseil"
* update CMake building system with abseil header include settings
* bump up the abseil submodule to the latest LTS branch of abseil:
lts_2024_01_16
* update scylla-gdb.py to adapt to the new structure of
flat_hash_map
This reverts commit 8635d24424.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18511
in seastar's b28342fa5a301de3facf5e83dc691524a6b20604, we switched
* `io_queue::_streams` from
`boost::container::small_vector<fair_queue, 2>` to
`boost::container::static_vector<fair_queue, 2>`
* `io_queue::_fgs` from
`std::vector<std::unique_ptr<fair_group>>` to
`boost::container::static_vector<fair_group, 2>`
so we need to update the gdb script accordingly to reflect this
change, and to avoid the nested try-except blocks, we switch to
a `while` statement to simplify the code structure.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18165
Seastar removed `task_queue::_current` in
258b11220d343d8c7ae1a2ab056fb5e202723cc8 . let's adapt scylla-gdb.py
accordingly. despite that `current_scheduling_group_ptr()` is an internal
API, it's been around for a while, and relatively stable. so let's use
it instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17720
this changeset addresses some warnings raised by flake8 in hope to improve the readability of this script in general.
Closesscylladb/scylladb#17668
* github.com:scylladb/scylladb:
scylla-gdb: s/if not foo is None/if foo is not None/
scylla-gdb.py: add space after keyword
scylla-gdb.py: remove extraneous spaces
scylla-gdb.py: use 2 empty lines between top-level funcs/classes
scylla-gdb.py: replace <tab> with 4 spaces
scylla-gdb: fix the indent
it'd be more pythonic to just put an expression after `assert`,
instead of quoting it with a pair of parenthesis. and there is no need
to add `;` after `break`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Make a specialized sstable_set for tablets
via tablet_storage_group_manager::make_sstable_set.
This sstable set takes a snapshot of the storage_groups
(compound) sstable_sets and maps the selected tokens
directly into the tablet compound_sstable_set.
This sstable_set provides much more efficient access
to the table's sstable sets as it takes advantage of the disjointness
of sstable sets between tablets/storage_groups, and making it is cheaper
that rebuilding a complete partitioned_sstable_set from all sstables in the table.
Fixes#16876
Cassandra-stress setup:
```
$ sudo cpupower frequency-set -g userspace
$ build/release/scylla (developer-mode options) --smp=16 --memory=8G --experimental-features=consistent-topology-changes --experimental-features=tablets
cqlsh> CREATE KEYSPACE keyspace1 WITH replication={'class':'NetworkTopologyStrategy', 'replication_factor':1} AND tablets={'initial':2048};
$ ./tools/java/tools/bin/cassandra-stress write no-warmup n=10000000 -pop 'seq=1...10000000' -rate threads=128
$ scylla-api-client system drop_sstable_caches POST
$ ./tools/java/tools/bin/cassandra-stress read no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128
$ scylla-api-client system drop_sstable_caches POST
$ ./tools/java/tools/bin/cassandra-stress mixed no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128
```
Baseline (0a7854ea4d) vs. fix (0c2c00f01b)
Throughput (op/s):
workload | baseline | fix
---------|----------|----------
write | 76,806 | 100,787
read | 34,330 | 106,099
mixed | 32,195 | 79,246
Closesscylladb/scylladb#17149
* github.com:scylladb/scylladb:
table: tablet_storage_group_manager: make tablet_sstable_set
storage_group_manager: add make_sstable_set
tablet_storage_group_manager: handle_tablet_split_completion: pre-calc new_tablet_count
table: tablet_storage_group_manager: storage_group_of: do not validate in release build mode
table: move compaction_group_list and storage_group_vector to storage_group_manager
compaction_group::table_state: get_group_id: become self-sufficient
compaction_group, table: make_compound_sstable_set: declare as const
tablet_storage_group_manager: precalculate my_host_id and _tablet_map
table: coroutinize update_effective_replication_map
New keyspace is added similarly as system_schema keyspace,
it's being registred via system_keyspace::make which calls
all_tables to build its schema.
Dummy table 'roles' is added as keyspaces are being currently
registered by walking through their tables. Full table schemas
will be added in subsequent commits.
Change can be observed via cqlsh:
cassandra@cqlsh> describe keyspaces;
system_auth_v2 system_schema system system_distributed_everywhere
system_auth system_distributed system_traces
cassandra@cqlsh> describe keyspace system_auth_v2;
CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'} AND durable_writes = true;
CREATE TABLE system_auth_v2.roles (
role text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = 'comment'
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 604800
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token.
We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.
Closesscylladb/scylladb#17455
* github.com:scylladb/scylladb:
interval: rename nonwrapping_interval to interval
interval: rename interval_test to wrapping_interval_test
when '\' does not start an escape sequence, Python complains at seeing
it. but it continues anyway by considering '\' as a separate char.
but the warning message is still annoying:
```
scylla-gdb.py: 2417: SyntaxWarning: invalid escape sequence '\-'
branches = (r" |-- ", " \-- ")
```
when sourcing this script.
so, let's mark these strings as raw strings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17466
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.
We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.
This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.
To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
State changes are processed as a batch and
there is no reason to maintain them as an ordered map.
Instead, use a std::unordered_map that is more efficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Storage group is the storage of tablets. This new concept is helpful
for tablet splitting, where the storage of tablet will be split
in multiple compaction groups, where each can be compacted
independently.
The reason for not going with arena concept is that it added
complexity, and it felt much more elegant to keep compaction
group unchanged which at the end of the day abstracts the concept
of a set of sstables that can be compacted and operated
independently.
When splitting, the storage group for a tablet may therefore own
multiple compaction groups, left, right, and main, where main
keeps the data that needs splitting. When splitting completes,
only left and right compaction groups will be populated.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reduce code duplication by defining each metric just once, instead of three times, by having the semaphore register metrics by itself. This also makes the lifecycle of metrics contained in that of the semaphore. This is important on enterprise where semaphores are added and removed, together with service levels.
We don't want all semaphores to export metrics, so a new parameter is introduced and all call-sites make a call whether they opt-in or not.
Fixes: https://github.com/scylladb/scylladb/issues/16402Closesscylladb/scylladb#16383
* github.com:scylladb/scylladb:
database, reader_concurrency_sempaphore: deduplicate reader_concurrency_sempaphore metrics
reader_concurrency_semaphore: add register_metrics constructor parameter
sstables: name sstables_manager
reader_concurrency_sempaphore are triplicated: each metrics is registered
for streaming, user, and system classes.
To fix, just move the metrics registration from database to
reader_concurrency_sempaphore, so each reader_concurrency_sempaphore
instantiated will register its metrics (if its creator asked for it).
Adjust the names given to reader_concurrency_sempaphore so we don't
change the labels.
scylla-gdb is adjusted to support the new names.
* seastar bab1625c...17183ed4 (73):
> thread_pool: Reference reactor, not point to
> sstring: inherit publicly from string_view formatter
> circleci: use conditional steps
> weak_ptr: include used header
> build: disable the -Wunused-* warnings for checkheaders
> resource: move variable into smaller lexical scope
> resource: use structured binding when appropriate
> httpd: Added server and client addresses to request structure
> io_queue: do not dereference moved-away shared pointer
> treewide: explicitly define ctor and assignment operator
> memory: use `err` for the error string
> doc: Add document describing all the math behind IO scheduler
> io_queue: Add flow-rate based self slowdown backlink
> io_queue: Make main throttler uncapped
> io_queue: Add queue-wide metrics
> io_queue: Introduce "flow monitor"
> io_queue: Count total number of dispatched and completed requests so far
> io_queue: Introduce io_group::io_latency_goal()
> tests: test the vector overload for when_all_succeed
> core: add a vector overload to when_all_succeed
> loop: Fix iterator_range_estimate_vector_capacity for random iters
> loop: Add test for iterator_range_estimate_vector_capacity
> core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
> memory: s/throw()/noexcept/
> build: enable -Wdeprecated compiler option
> reactor: mark kernel_completion's dtor protected
> tests: always wait for promise
> http, json, net: define-generated copy ctor for polymorphic types
> treewide: do not define constexpr static out-of-line
> reactor: do not define dtor of kernel_completion
> http/exception: stop using dynamic exception specification
> metrics: replace vector with deque
> metrics: change metadata vector to deque
> utils/backtrace.hh: make simple_backtrace formattable
> reactor: Unfriend disk_config_params
> reactor: Move add_to_flush_poller() to internal namespace
> reactor: Unfriend a bunch of sched group template calls
> rpc_test: Test rpc send glitches
> net: Implement batch flush support for existing sockets
> iostream: Configure batch flushes if sink can do it
> net: Added remote address accessors
> circleci: update the image to CircleCI "standard" image
> build: do not add header check target if no headers to check
> build: pass target name to seastar_check_self_contained
> build: detect glibc features using CMake
> build: extract bits checking libc into CheckLibc.cmake
> http/exception: add formatter for httpd::base_exception
> http/client: Mark write_body() const
> http/client: Introduce request::_bytes_written
> http/client: Mark maybe_wait_for_continue() const
> http/client: Mark send_request_head() const
> http/client: Detach setup_request()
> http/api_docs: copy in api_docs's copy constructor
> script: do not inherit from object
> scripts: addr2line: change StdinBacktraceIterator to a function
> scripts: addr2line: use yield instead defining a class
> tests: skip tests that require backtrace if execinfo.h is not found
> backtrace: check for existence of execinfo.h
> core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
> core: add sleep_abortable instantiation for manual_clock
> tls: Return EPIPE exception when writing to shutdown socket
> http/client: Don't cache connection if server advertises it
> http/client: Mark connection as "keep in cache"
> core: fix strerror_r usage from glibc extension
> reactor: access sigevent.sigev_notify_thread_id with a macro
> posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
> reactor: replace __mode_t with mode_t
> reactor: change sys/poll.h to posix poll.h
> rpc: Add unit test for per-domain metrics
> rpc: Report client connections metrics
> rpc: Count dead client stats
> rpc: Add seastar::rpc::metrics
> rpc: Make public queues length getters
io-scheduler fixes
refs: #15312
refs: #11805
http client fixes
refs: #13736
refs: #15509
rpc fixes
refs: #15462Closesscylladb/scylladb#15774
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.
Fixesscylladb/scylladb#14799
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb
(see also ae6fdf1599)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15203