Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
If seal_snapshot fails we currently do not signal
the manifest_write semaphore and shards waiting for
it will be blocked forever.
Also, call manifest_write.wait in a `finally` clause
rather than in a `then` clause, even though
`my_work` future never fails at the moment,
to make this future proof.
Fixes#9936
Test: database_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117181733.3706764-1-bhalevy@scylladb.com>
"
Said method should take care of checking that parsing stopped in a valid
state. This patch-set expands the existing but very lacking
implementation by improving the existing error message and adding an
additional check for prematurely exiting the parser in the middle of
parsing an index entry, something we've seen recently in #9446.
To help in debugging such issues, some additional information is added
to the trace messages.
The series also fixes a bug in the error handling code of the partition
index cache.
Refs: #9446
Tests: unit(dev)
"
* 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla:
sstables/index_reader: process_state(): add additional information to trace logging
sstables/index_reader: verify_end_state(): add check for premature EOS
sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
sstables/index_reader: add const sstable& to index_consume_entry_context
sstables/index_reader: remove unused members from index_consume_entry_context
bytes_on_disk is intended to reflect the bytes allocated for the
sstable files on disk.
Accumulating the files logical size, as done today, causes a
discrepancy between information retrieved over the
storage_service/sstables_info api, like nodetool status or nodetool
cfstats and command line tools like df -H /var/lib/scylla.
Fixes#9941
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220118070208.3963076-1-bhalevy@scylladb.com>
Errors during parsing are usually reported via malformed sstable
exception to signify their gravity of potentially being caused by
corrupt sstables. This patch converts the exception thrown in
`index_consume_entry_context::verify_end_state()`.
While at it the error message is improved as well. It currently suggests
that parsing was ended prematurely because data ran out, while in fact
the condition under which this error is thrown is the opposite: parsing
ended but there is unconsumed data left. The current state is also added
to the error message.
"
Currently commands are regular switches. This has several disadvantages:
* CLI programs nowadays use the command-based UX, so our tools are
awkward to use to anybody used to that;
* They don't stand out from regular options;
* They are parsed at the same time as regular options, so all options
have to be dumped to a single description;
This series migrates the tools to the command based CLI. E.g. instead of
scylla sstable --validate --merge /path/to/sst1 /path/to/sst2
we now have:
scylla sstable validate --merge /path/to/sst1 /path/to/sst2
Which makes it much clearer that "validate" is the command and "merge"
is an option. And it just looks better.
Internally the command is parsed and popped from argv manually just as
we do with the tool name in scylla main(). This means we know the
command before even building the
boost::program_options::options_description representation and thus
before creating the seastar::app_template instance. Consequently we can
tailor the options registered and the --help content (the application
description) to the command run.
So now "scylla sstable --help" prints only a general description of the
tool and a list of the supported operations. Invoking "scylla sstable
{operation} --help" will print a detailed description of the operation
along with its specific options. This greatly improves the documentation
and the usability of the tool.
"
Refs #9882
* 'tools-command-oriented-cli/v1' of https://github.com/denesb/scylla:
tools/scylla-sstable: update general description
tools/scylla-sstable: proper operation-specific --help
tools/scylla-sstable: proper operation-specific options
tools/scylla-sstable: s/dump/dump-data/
tools/utils: remove now unused get_selected_operation() overload
tools: take operations (commands) as positional arguments
tools/utils: add positional-argument based overload of get_selected_operation()
tools: remove obsolete FIXMEs
If the compared mutations have binary keys, `colordiff` will declare the
file as binary and will refuse to compare them, beyond a very unhelpful
"binary files differ" summary. Add "-a" to the command line to force a
treating all files as text.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220117131347.106585-1-bdenes@scylladb.com>
Thanks to an older boost, there is an ambiguity in
name resolution between
boost::placeholders and std::placeholders.
Message-Id: <20220117094837.653145-2-kostja@scylladb.com>
The create_keyspace_if_not_exists_impl() gets global instance of
storage proxy, but its only caller (controller) already have it
and can pass via argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220117104226.22833-1-xemul@scylladb.com>
Currently, we first delete all existing token mappings
for the endpoint from _token_to_endpoint_map and then
we add all updated token mappings for it and set should_sort_tokens
if the token is newly inserted, but since we removed all
existing mappings for the endpoint unconditionally, we
will sort the tokens even if the token existed and
its ownership did not change.
This is worthwhile since there are scenarios where
none of the token ownership change. Searching and
erasing tokens from the tokens unordered_set runs
at constant time on average so doing it for n tokens
is O(n), while sorting the tokens is O(n*log(n)).
Test: unit(dev)
DTest: replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap(dev,debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-2-bhalevy@scylladb.com>
It's currently used only by unit tests
and it is dangerous to use on a populated token_metadata
as update_normal_tokens assumes that the set of tokens
owned by the given endpoint is compelte, i.e. previous
tokens owned by the endpoint are no longer owned by it,
but the single-token update_normal_token interface
seems commulative (and has no documentation whatsoever).
It is better to remove this interface and calculate a
complete map of endpoint->tokens from the tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-1-bhalevy@scylladb.com>
Add the missing partition-key validation in INSERT JSON statements.
Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).
Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.
The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.
Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.
This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.
Reported by @FortTell
Fixes#9853.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
build_progress_virtual_reader is a virtual reader that trims off
the last clustering key column from an underlying base table. It
is here converted to flat_mutation_reader_v2.
Because range_tombstone_change uses position_in_partition, not
clustering_key_prefix, we need a new adjust_ckey() overload.
Note the transformation is likely incorrect. When trimming the
last clustering key column, an inclusive bound changes should
change to exclusive. However, the original code did not do this,
so we don't fix it here. It's immaterial anyway since the base
table doesn't include range tombstones.
Test: unit (dev) (which has a test for this reader)
Closes#9913
Some of the CQL tests translated from Cassandra into the test/cql-pytest
framework used the flush() function to force a flush to sstables -
presumably because this exercised yet another code path, or because it
reproduced bugs that Cassandra once had that were only visible when
reading from sstables - not from memtables.
Until now, this flush() function was stubbed and did nothing.
But we do have in test/cql-pytest a flush() implementation in
nodetool.py - which uses the REST API if possible and if not (e.g., when
running against Cassandra) uses the external "nodetool" command.
So in this patch flush() starts to use nodetool.flush() instead of
doing nothing.
The tests continue to pass as before after this patch, and there is no
noticable slowdown (the flush does take time, but the few times it's
done is negligible in these tests).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220117073112.83994-1-nyh@scylladb.com>
"
As a prerequisite the mutation fragment stream validator is converted to
v2 as well (but it still supports v1). We get one step closer to
eliminate conversions altogether from compaction.cc.
Tests: unit(dev)
"
* 'scrub-v2/v1' of https://github.com/denesb/scylla:
mutation_writer: remove v1 version segregate_by_partition()
compaction/compaction: remove v1 version of validate and scrub reader factory methods
tools/scylla-sstable: migrate to v2
test/boost/sstable_compaction_test: migrate validation tests to v2
test/boost/sstable_compaction_test: migrate scrub tests to v2
test/lib/simple_schema: add v2 of make_row() and make_static_row()
compaction: use v2 version of mutation_writer::segregate_by_partition()
mutation_writer: add v2 version of segregate_by_partition()
compaction: migrate scrub and validate to v2
mutation_fragment_stream_validator: migrate validator to v2
"
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.
Fixes: #9751
"
* 'repair-double-permit-block/v4' of https://github.com/denesb/scylla:
repair: make sure there is one permit per repair with count res
reader_permit: add release_base_resource()
Now that issues #7586 and #9487 were fixed, reverse queries - even in
long partitions - work well, we can drop the claim in
alternator/docs/compatibility.md that reverse queries are buggy for
large partitions.
We can also remove the "xfail" mark from the tes that checks this
feature, as it now passes.
Refs #7586
Refs #9487
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#9831
The series to move storage_proxy verbs to the IDL added not features to
the IDL compiler, but was lacking a documentation. This patch documents
the features.
dirty_memory_manager monitors memory and triggers memtable flushing if
there is too much pressure. If bad_alloc happens during the flush, it
may break the loop and flushes won't be triggered automatically, leading
to blocked writes as memory won't be automatically released.
The solution is to add exception handling to the loop, so that the inner
part always returns a non-exceptional future (meaning the loop will
break only on node shutdown).
try/catch is used around on_internal_error instead of
on_internal_error_noexcept, as the latter doesn't have a version that
accepts an exception pointer. To get the exception message from
std::exception_ptr a rethrow is needed anyway, so this was a simpler
approach.
Fixes: #4174
Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>
"
SSTables created by repair will potentially not conform to the
compaction strategy
layout goal. If node shuts down before off-strategy has a chance to
reshape those files, node will be forced to reshape them on restart.
That
causes unexpected downtime. Turns out we can skip reshape of those files
on boot, and allow them to be reshaped after node becomes online, as if
the node never went down. Those files will go through same procedure as
files created by repair-based ops. They will be placed in maintenance
set,
and be reshaped iteratively until ready for integration into the main
set.
"
Fixes#9895.
tests: UNIT(dev).
* 'postpone_reshape_on_repair_originated_files' of https://github.com/raphaelsc/scylla:
distributed_loader: postpone reshape of repair-originated sstables
sstables: Introduce filter for sstable_directory::reshape
table: add fast path when offstrategy is not needed
sstables: add constant for repair origin
"
The main motivation for the set is to expell query_processor.proxy().local_db()
calls from cql3/statements code. The only places that still use q.p. like this
are those calling client_state::has_..._access() checkers. Those checks can
go with the data_dictionary which is already available on the query processor.
This is the continuation of the 9643f84d ("Eliminate direct storage_proxy usage
from cql3 statements") patch set.
As a side effect the validation/ code, that's called from has_..._access checks,
is also converted to use data_dictionary.
tests: unit(dev, debug)
"
* 'br-cql3-dictionary' of https://github.com/xemul/scylla:
validation: Make validate_column_family use data_dictionary::database
client_state: Make has_access use data_dictionary::database
client_state: Make has_schema_access use data_dictionary::database
client_state: Make has_column_family_access use data_dictionary::database
client_state: Make has_keyspace_access use data_dictionary::database
Off-strategy compaction works by iteratively reshaping the maintenance set
until it's ready for integration into the main set. As repair-based ops
produces disjoint sstables only, off-strategy compaction can complete
the reshape in a single round.
But if reshape ends up requiring more than one round, space requirement
for off-strategy to succeed can be high. That's because we're only
deleting input SSTables on completion. SSTables from maintenance set
can be only deleted on completion as we can only merge maintenance
set into main one once we're done reshaping[1]. But a SSTable that was
created by a reshape and later used as a input in another reshape can
be deleted immediately as its existence is not needed anywhere.
[1] We don't update maintenance set after each reshape round, because that
would mess with its disjointness. We also don't iteratively merge
maintenance set into main set, as the data produced by a single round
is potentially not ready for integration into main set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220111202950.111456-1-raphaelsc@scylladb.com>
"
The series moves raft verbs to the IDL and also fix some verbs to be one
way like they were intended to be.
"
* 'gleb/raft-idl' of github.com:scylladb/scylla-dev:
raft service: make one way raft messages truly one way
raft: move raft verbs to the IDL
raft: split idl to rpc and storage
idl-compiler: always produce const variant of serializers
raft: simplify raft idl definitions
And instantly convert the validate_keyspace() as it's not called
from anywhere but the validate_column_family().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This db argument is only needed to be pushed into
cdc::is_log_for_some_table() helper. All callers already have
the d._d.::database at hands and convert it into .real_database()
call-time, so this patch effectively generalizes those calls to
the .real_database().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now called with d._d.::database converted to .real_database()
right in the argument passing, so this change can be treated as
the generalization of that .real_database() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Straightforward replacement. Internals of the has_column_family_access()
temporarily get .real_database(), but it will be changed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Straightforward replacement. Internals of the has_keyspace_access()
temporarily get .real_database(), but it will be changed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just a facade using converters behind the scenes. The actual segregator
is not worth migrating to v2 while mutation and the flushing readers
don't have a v2 versions. Still, migrating all users to a v2 API allows
the conversion to happen at a single point where more work is necessary,
instead of scattered around all the users.
We leave the v1 version in place to aid incremental migration to the v2
one.
Add support for validating v2 streams while still keeping the v1
support. Since the underlying logic is largely independent of the format
version, this is simple to do and will allow incremental migration of
users.
As requested from field engineering, add a way to disable
the optimized TWCS query algorithm (use regular query path)
just in case a bug or a performance regression shows up in
production.
To disable the optimized query path, add
'enable_optimized_twcs_queries': 'false' to compaction strategy options,
e.g.
```
alter table ks.t with compaction =
{'class': 'TimeWindowCompactionStrategy',
'enable_optimized_twcs_queries': 'false'};
```
Setting the `enable_optimized_twcs_queries` key to anything other than
`'false'` (note: a boolean `false` expands to a string `'false'`) or
skipping it (re)enables the optimized query path.
Note: the flag can be set in a cluster in the middle of upgrade. Nodes
which do not understand it simply ignore it, but they do store it in
their schema tables (they store the entire `compaction` map). After
these nodes are upgraded, they will understand the flag and act
accordingly.
Note: in the situation above, some nodes may use the optimized path and
some may use the regular path. This may happen also in a fully upgraded
cluster when compaction options are changed concurrently to reads;
there is a short period of time where the schema change propagates and
some nodes got the flag but some didn't.
These should not be a problem since the optimization does not change the
returned read results (unless there is a bug).
Generally, the flag is not intended for normal use, but for field
engineers to disable it in case of a serious problem.
Ref #6418.
Closes#9900