Merged patch series by Piotr Sarna:
This series introduces the concept of "computed" column, which represents
values not provided directly by the user, but computed on the fly -
possibly using other column values. It will be used in the future to
implement map value indexing, collection indexing, etc. Right now the only
use is the token column for secondary indexes - which is a column computed
from the base partition key value.
After this series, another one that depends on it and adds map value
indexing will be pushed.
Tests: unit(dev)
Piotr Sarna (14):
schema: add computed info to column definition
schema: add implementation of computing token column
schema: allow marking columns as computed in schema builder
service: add computed columns feature
view: check for computed columns in view
view: remove unused token_for function
database: add fixing previous secondary index schemas
tests: disable computed columns feature in schema change test
tests: add schema change test regeneration comment
db: add system_schema.computed_columns
docs: init system_schema_keyspace.md with column computations
tests: generate new test case for schema change + computed cols
index: mark token column as 'computed' when creating mv
tests: add checking computed columns in SI
column_computation.hh | 63 ++++++++
db/schema_features.hh | 4 +-
db/schema_tables.hh | 4 +
idl/frozen_schema.idl.hh | 1 +
schema.hh | 40 +++++
schema_builder.hh | 4 +-
schema_mutations.hh | 18 ++-
service/storage_service.hh | 8 +
view_info.hh | 2 -
database.cc | 6 +-
db/schema_tables.cc | 146 ++++++++++++++++--
db/view/view.cc | 46 +++---
index/secondary_index_manager.cc | 2 +-
schema.cc | 58 ++++++-
schema_mutations.cc | 14 +-
service/storage_service.cc | 5 +
tests/schema_change_test.cc | 63 ++++++--
tests/secondary_index_test.cc | 28 ++++
docs/system_schema_keyspace.md | 40 +++++
plus about 200 new test sstable files
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after computed columns
are allowed, in order to make sure that the digest computed
including computed columns does not change spuriously as well.
Schema change test might need regenerating every time a system table
is added. In order to save future developer's time on debugging this
test, a short description of that requirement is added.
In order to make sure that old schema digest is not recomputed
and can be verified - computed columns feature is initially disabled
in schema_change_test.
The reason for that is as follows: running CQL test env assumes that
we are running the newest cluster with all features enabled. However,
the mere existence of some features might influence digest calculation.
So, in order for the existing test to work correctly, it should have
exactly the same set of cluster supported features as it had during
its creation. It used to be "all features", but now it's "all features
except computed columns". One can think of that as running a cluster
with some nodes not yet knowing what computed columns are, so they
are not taken into account when computing digests.
Additionally, a separate test case that takes computed column digest
into account will be generated and added in this series.
If a schema was created before computed columns were implemented,
its token column may not have been marked as computed.
To remedy this, if no computed column is found, the schema
will be recreated.
The code will work correctly even without this patch in order to support
upgrading from legacy versions, but it's still important: it transforms
token columns from the legacy format to new computed format, which will
eventually (after a few release cycles) allow dropping the support for
legacy format altogether.
Computed columns feature should be checked before creating
index schemas the new way - by adding computed column names
to system_schema.computed_columns.
Some columns may represent not user-provided values, but ones computed
from other columns. Currently an example is token column used in secondary
indexes to provide proper ordering. In order to avoid hardcoding special
cases in execution stage, optional additional information for computed
columns is stored in column definition.
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.
That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.
This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.
Fixes#4659.
In v2:
- Added an overload without partition_slice to avoid changing existing users which never slice
Tests:
- unit (dev)
- manual (3 node ccm + repair)
Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.
When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.
As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.
To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.
Refs: #4708
"
Fix another source of flakyness in mutation_reader_test. This one is caused by storage_service_for_tests lacking a config::broadcast_to_all_shards() call, triggering an invalid memory access (or SEGFAULT) when run on more than one shards.
Refs: #4695
"
* 'fix_storage_service_for_tests' of https://github.com/denesb/scylla:
tests: storage_service_for_tests: broadcast config to all shards
tests: move storage_service_for_tests impl to test_services.cc
The function announce_column_family_drop() drops (deletes) a base table
and all the materialized-views used for its secondary indexes, but not
other materialized views - if there are any, the operation refuses to
continue. This is exactly what CQL's "DROP TABLE" needs, because it is
not allowed to drop a table before manually dropping its views.
But there is no inherent reason why it we can't support an operation
to delete a table and *all* its views - not just those related to indexes.
This patch adds such an option to announce_column_family_drop().
This option is not used by the existing CQL layer, but can be used
by other code automating operations programatically without CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716150559.11806-1-nyh@scylladb.com>
Currently scylla-server.service uses DefaultTimeoutStopSec = 90, if Scylla
does not able to clean-shutdown in 90sec we may have data corruption on the node.
Since we already set TimeoutStartSec = 900, we can use TimeoutSec to set both
TimeoutStartSec and TimeoutStopSec to 900.
See #4700
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190717095416.10652-1-syuu@scylladb.com>
All statement objects which derive from cf_statement, including
drop_index_statement, have a column_family() returning the name of the
column family involved in this statement. For most statement this is
known at the time of construction, because it is part of the statement,
but for "DROP INDEX", the user doesn't specify the table's name - just
the index name. So we need to override column_family() to find the
table name.
The existing implementation assert()ed that we can always find such
a table, but this is not true - for example, in a DROP INDEX with
"IF EXISTS", it is perfectly fine for no such table to exist. In this
case we don't want a crash, and not even an except - it's fine that
we just return an empty table name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716180104.15985-1-nyh@scylladb.com>
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.
Fixes#4622
Test: unit(dev), nodetool_additional_test.py migration_test.py
"
* 'scylla-4622-fix-disable-sstable-write' of https://github.com/bhalevy/scylla:
table: document _sstables_lock/_sstable_deletion_sem locking order
table: disable_sstable_write: acquire _sstable_deletion_sem
table: uninline enable_sstable_write
table: reshuffle_sstables: add log message
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.
The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.
This patch will throw a bad_configuration_error exception
in this case.
Fixes#3889
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#4717
Bug in ipv6 support series caused inet_address serialization
to include an additional "size" parameter in the address chunk.
Message-Id: <20190716134254.20708-1-calle@scylladb.com>
Due to recent changes to the config subsystem, configuration has to be
broadcast to all shards if one wishes to use it on them. The
`storage_service_for_tests` has a `sharded<gms::gossiper>` member, which
reads config values on initialization on each shard, causing a crash as
the configuration was initialized only on shard 0. Add a call to
`config::broadcast_to_all_shards()` to ensure all shards have access to
valid config values.
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
Currently the
test_multishard_combining_reader_non_strictly_monotonic_positions is
flaky. The test is somewhat unconventional, in that it doesn't use the
same instance of data as the input to the test and as it's expected
output, instead it invokes the method which generates this data
(`make_fragments_with_non_monotonic_positions()`) twice, first to
generate the input, and a secondly to generate the expected output. This
means that the test is prone to any deviation in the data generated by
said method. One such deviation, discovered recently, is that the method
doesn't explicitly specify the deletion time of the generated range
tombstones. This results in this deletion time sometimes differing
between the test input and the expected output. Solve by explicitly
passing the same deletion time to all created range tombstones.
Refs: #4695
Fixes a segfault when querying for an empty keyspace.
Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.
Fixes#4689.
Start n1, n2
Create ks with rf = 2
Run repair on n2
Stop n2 in the middle of repair
n1 will notice n2 is DOWN, gossip handler will remove repair instance
with n2 which calls remove_repair_meta().
Inside remove_repair_meta(), we have:
```
1 return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
2 return rm->stop();
3 }).then([repair_metas, from] {
4 rlogger.debug("Removed all repair_meta for single node {}", from);
5 });
```
Since 3.1, we start 16 repair instances in parallel which will create 16
readers.The reader semaphore is 10.
At line 2, it calls
```
6 future<> stop() {
7 auto gate_future = _gate.close();
8 auto writer_future = _repair_writer.wait_for_writer_done();
9 return when_all_succeed(std::move(gate_future), std::move(writer_future));
10 }
```
The gate protects the reader to read data from disk:
```
11 with_gate(_gate, [] {
12 read_rows_from_disk
13 return _repair_reader.read_mutation_fragment() --> calls reader() to read data
14 })
```
So line 7 won't return until all the 16 readers return from the call of
reader().
The problem is, the reader won't release the reader semaphore until the
reader is destroyed!
So, even if 10 out of the 16 readers have finished reading, they won't
release the semaphore. As a result, the stop() hangs forever.
To fix in short term, we can delete the reader, aka, drop the the
repair_meta object once it is stopped.
Refs: #4693
Fixes#4713
Modifying config files to use sharded storage misses the fact
that extensions are allowed to add non-member config fields to
the main configuration, typically from "extra" config_file
objects.
Unless those "extra" files are broadcast when main file broadcast,
the values will not be readable from other shards.
This patch propagates the broadcast to all other config files
whose entries are in the top level object. This ensures we
always keep data up to date on config reload.
Message-Id: <20190715135851.19948-1-calle@scylladb.com>
configure.py currently takes some time to write build.ninja. If the user
interrupts (e.g., control-C) configure.py, it can leave behind a partial
or even empty build.ninja file. This is most frustrating when the user
didn't explicitly run "configure.py", but rather just ran "ninja" and
ninja decided to run configure.py, and after interrupting it the user
cannot run "ninja" again because build.ninja is gone. Another result of
losing build.ninja is that the user now needs to remember which parameters
to run "configure.py", because the old ones stored in build.ninja were lost.
The solution in this patch is simple: We write the new build.ninja contents
into a temporary file, not directly into build.ninja. Then, only when the
entire file has been succesfully written, do we rename the temporary file
to its intended name - build.ninja.
Fixes#4706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190715122129.16033-1-nyh@scylladb.com>
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.
Tests:
1. Unit tests (release)
2. Auth and cqlsh auth related dtests.
Fixes#4226
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
In scylla-debuginfo package, we have /usr/lib/debug/opt/scylladb/libreloc/libthread_db-1.0.so-666.development-0.20190711.73a1978fb.el7.x86_64.debug
but we actually does not have libthread_db.so.1 in /opt/scylladb/libreloc
since it's not available on ldd result with scylla binary.
To debug thread, we need to add the library in a relocatable package manually.
Fixes#4673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190711111058.7454-1-syuu@scylladb.com>
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.
Fixes#4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.
Fixes#4672
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes#4689.
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.
Refactor tests to use seastar::thread.
`disable_sstable_write` needs to acquire `_sstable_deletion_sem`
to properly synchronize with background deletions done by
`on_compaction_completion` to ensure no sstables will be created
or deleted during `reshuffle_sstables` after
`storage_service::load_new_sstables` disables sstable writes.
Fixes#4622
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.
Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.
Fixes#4653.
"
If the user creates a keyspace with the 'SimpleStrategy' replication class
in a multi-datacenter environment, they will receive a warning in the CQL shell
and in the server logs.
Resolves#4481 and #4651.
"
* 'multidc' of https://github.com/kbr-/scylla:
Warn user about using SimpleStrategy with Multi DC deployment
Add warning support to the CQL binary protocol implementation