"The patch fixes a few issues caused by generalizing the ami scripts. The
scylla_bootparam_setup requires invocation with ami flag. The
scylla_install is missing some steps executed by the scylla-ami.sh."
scylla-ami.sh moved some ami specific files. This parts have been
dropped when converging scylla-ami into scylla_install. Fixing that.
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
The script imports the /etc/sysconfig/scylla-server for configuration
settings (NR_PAGES). The /etc/sysconfig/scylla-server iincludes an AMI
param which is of string value and called as a last step in
scylla_install (after scylla_bootparam_setup has been initated).
The AMI variable is setup in scylla_install and is used in multiple
scripts. To resolve the conflict moving the import of
/etc/sysconfig/scylla-server after the AMI variable has been compared.
Fixes: #744
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
After upgrading an AMI and trying to stop and start a machine the
/var/lib/scylla/coredump is not created. Create the directory if it does
not exist prior to generating core
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Commit 2ba4910 ("main: verify that the NOFILE rlimit is sufficient")
added a recommendation to set NOFILE rlimit to 200k. Update our release
binaries to do the same.
"This series solve an issue with the load broadcaster that reports negative
values due to an integer wrap around. While fixing this issue an additional
change was made so that the load_map would return doubles and not formatted
string. This is a better API, safer and better documented."
Since commit 16596385ee, long_token() is already checking
t.is_minimum(), so the comment which explains why it does not (for
performance) is no longer relevant. And we no longer need to check
t._kind before calling long_token (the check we do here is the same
as is_minimum).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Check the list of column families passed as an option to repair, to
provide the user with a more meaningful exception when a non-existant
column family is passed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This was a plain bug - ranges_opt is supposed to parse the option into
the vector "var", but took the vector by value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Support the "columnFamilies" parameter of repair, allowing to repair
only some of the column families of a keyspace, instead of all of them.
For example, using a command like "nodetool repaire keyspace cf1 cf2".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The script scylla_coredump_setup was introduced in
9b4d0592, and added to the scylla rpm spec file, as a
post script. However, calling yum when there's one
yum instance installng scylla server will cause a deadlock,
since yum waits for the yum lock to be released, and the
original yum process waits for the script to end.
So let's remove this from the script. Debian shouldn't be
affected, since it was never added to the debian build
rules (to the best of my knowlege, after analyzing 9b4d0592),
hence I did not remove it. It should cause the same problem
with apt-get in case it was used.
CC: Takuya ASADA <syuu@scylladb.com>
[ penberg: Rebase and drop statement about 'abrt' package not in Fedora. ]
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
A default value was not set for the "incremental" and "parallelism"
repair parameters, so Scylla can wrongly decide that they have an
unsupported value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The repair API use to have an undocumented parameter list similiar to
origin.
This patch changes the way repair is getting its parameters.
Instead of a one undocumented string it now lists all the different
optional parameters in the swagger file and accept them explicitely.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This change set the http server to start as the first step in the boot
order.
It is helpfull if some other step takes a long time or stuck.
Fixes#725
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Actually check that a snapshot directory with a given tag
exists instead of just checking that a 'snapshot' directory
exists.
Fixes issue #689
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In origin the storage_serivce report the load map as a formatted string.
As an API a better option is to report the load map as double and let
the JMX proxy do the formatting.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The map_reduce0 convert the result value to the init value type. In
load_bradcaster 0 is of type int.
This result with an int wrap around and negative results.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add partial support for the "incremental" option (only support the
"false" setting, i.e., not incremental repair) and the "parallelism"
option (the choice of sequential or parallel repair is ignored - we
always use our own technique).
This is needed because scylla-jmx passes these options by default
(e.g., "incremental=false" is passed to say this is *not* incremental
repair, and we just need to allow this and ignore it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When throwing an "unsupported repair options" exception to the caller
(such as "nodetool repair"), also list which options were not recognized.
Additionally, list the options when logging the repair operation.
This patch includes an operator<< implementation for pretty-printing an
std::unordered_map. We may want to move it later to a more central
location - even Seastar (like we have a pretty-printer for std::vector
in core/sstring.hh).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
max_purgeable was being incorrectly calculated because the code
that creates vector of uncompacted sstables was wrong.
This value is used to determine whether or not a tombstone can
be purged.
Operand < is supposed to be used instead in the callback passed
as third parameter to boost::set_difference.
This fix is a step towards closing the issue #676.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The start_native_transport() function in storage_service expects the
'enabled' option to be defined. If the option is not defined, it means
that encryption is implicitly disabled.
Fixes#718.
"Adds support for TLS/SSL encrypted (and cert verified)
connections for message service
* Modify config option to match "native" style cerificate management
* Add SSL options to messaging service and generate SSL server/client
endpoints when required
* Add config option handling to init/main"
* Massage user options in main
* Use them in storage_service, and if needed, load certificates etc
and pass to transport/cql server.
Conflicts:
service/storage_service.cc
Optional credentials argument determine if SSL or normal
server socket is created.
Note: This does not follow the pattern of "socket as argument", simply
because this is a distributed object, so only trivial or immutable
objects should be passed to it.
Describe scylla version of option.
Note, for test usage, the below should be workable:
server_encryption_options:
internode_encryption: all
certificate: seastar/tests/test.crt
truststore: seastar/tests/catest.pem
keyfile: seastar/tests/test.key
Since the seastar test suite contains a snakeoil cert + trust
combo
* Accept port + credentials + option for what to encrypt
* If set, enable a SSL listener at ssl_port
* Check outgoing connections by IP to determine if
they should go to SSL/normal endpoint
Requires seastar RPC patch
Note: currently, the connections created by messaging service
does _not_ do certificate name verification. While DNS lookup
is probably not that expensive here, I am not 100% sure it is
the desired behaviour.
Normal trust is however verified.
* Mark option used
* Make sub-options adapted to seastar-tls useable values (i.e. x509)
Syntax is now:
server_encryption_options:
internode_encryption: <none, all, dc, rack>
certificate: <path-to-PEM-x509-cert> (default conf/scylla.crt)
keyfile: <path-to-PEM-x509-key> (default conf/scylla.key)
truststore: <path-to-PEM-trust-store-file> (default empty,
use system trust)
"When a node gain or regain responsibility for certain token ranges, streaming
will be performed, upon receiving of the stream data, the row cache
is invalidated for that range.
Refs #484."
The describe_ring method in storage_service did not report the start and
end tokens.
Also for rpc addresses that are not the local address, it returned the
value representation (including the version) and not just the adress.
Fixes#695
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Fixes issue #638
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"Fixes stream_session hangs:
1) if the sending node is gone, the receiving peer will wait forever
2) if the node which should send COMPLETE_MESSAGE to the peer node is gone,
the peer node will wait forever"
This patch fixes a bug where the *first* run of "nodetool repair" always
returned immediately, instead of waiting for the repair to complete.
Repair operations are asynchronous: Starting a repair returns a numeric
id, which can then be used to query for the repair's completion, and this
is what "nodetool repair" does (through our JMX layer). We started with
the repair ID "0", the next one is "1", and so on.
The problem is that "nodetool repair", when it sees 0 being returned,
treats it not as a regular repair ID, but rather as an answer that
there is nothing to repair - printing a message to that effect and *not*
waiting for the repair (which was correctly started) to complete.
The trivial fix is to start our repair IDs at 1, instead of 0.
We currently do not return 0 in any case (we don't know there is nothing
to repair before we actually start the work, and parameter errors
cause an exception, not a return of 0).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If we can't open the file, we will fail with a misterious error. It is a costumary
scenario, though, since people who are unaware or have just forgotten about seastar's
restriction of direct io access may put those files in tmpfs and other mount points.
We have a direct_io check that is designed exactly for this purpose, so as to give
the user a better error message. This patch makes use of it.
Fixes#644
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is hard-coded as 30 seconds at the moment.
Usage:
$ scylla --ring-delay-ms 5000
Time a node waits to hear from other nodes before joining the ring in
milliseconds.
Same as -Dcassandra.ring_delay_ms in cassandra.
The midpoint() algorithm to find a token between two tokens doesn't
work correctly in case of wraparound. The code tried to handle this
case, but did it wrong. So this patch fixes the midpoint() algorithm,
and adds clearer comments about why the fixed algorithm is correct.
This patch also modifies two midpoint() tests in partitioner_test,
which were incorrect - they verified that midpoint() returns some expected
values, but expected values were wrong!
We also add to the test a more fundemental test of midpoint() correctness,
which doesn't check the midpoint against a known value (which is easy to
get wrong, like indeed happened); Rather we simply check that the midpoint
is really inside the range (according to the token ordering operator).
This simple test failed with the old implementation of midpoint() and
passes with the new one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The problem is that we set the session state to WAIT_COMPLETE in
send_complete_message's continuation, the peer node might send
COMPLETE_MESSAGE before we run the continuation, thus we set the wrong
status in COMPLETE_MESSAGE's handler and will not close the session.
Before:
GOT STREAM_MUTATION_DONE
receive task_completed
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: PREPARING -> WAIT_COMPLETE
GOT COMPLETE_MESSAGE Reply
maybe_completed: WAIT_COMPLETE -> WAIT_COMPLETE
After:
GOT STREAM_MUTATION_DONE
receive task_completed
maybe_completed: PREPARING -> WAIT_COMPLETE
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: WAIT_COMPLETE -> COMPLETE
Session with 127.0.0.2 is complete
If the session is idle for 10 minutes, close the session. This can
detect the following hangs:
1) if the sending node is gone, the receiving peer will wait forever
2) if the node which should send COMPLETE_MESSAGE to the peer node is
gone, the peer node will wait forever
Fixes simple_kill_streaming_node_while_bootstrapping_test.
Get from address from cinfo. It is needed to figure out which stream
session this mutation is belonged to, since we need to update the keep
alive timer for this stream session.
Currently, if the node is actually down, although the streaming_timeout
is 10 seconds, the sending of the verb will return rpc_closed error
immediately, so we give up in 20 * 5 = 100 seconds. After this change,
we give up in 10 * 30 = 300 seconds at least, and 10 * (30 + 30) = 600
seconds at most.
It is oneway message at the moment. If a COMPLETE_MESSAGE is lost, no
one will close the session. The first step to fix the issue is to try to
retransmit the message.
$NAME is full name of distribution, for script it is too long.
$ID is shortened one, which is more useful.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
The helper function for summing statistic over the column family are
template function that infer the return type acording to the type of the
Init param.
In the API the return value should be int64_t, passing an integer would
cause a number wrap around.
A partial output from the nodetool cfstats after the fix
nodetool cfstats keyspace1
Keyspace: keyspace1
Read Count: 0
Read Latency: NaN ms.
Write Count: 4050000
Write Latency: 0.009178098765432099 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 12
Space used (live): 1118617445
Space used (total): 23336562465
Fixes#682
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
From Calle:
Fixes#589
Query should not return dangling static row in partition without any
regular/ck columns if a CK restriction is applied.
Refs #650
Fixes bug in CK range code for paging, and removes CK use for tables with not
clustering -> way simpler code. Also removed lots of workaround code no longer
required.
Note that this patch set does not fully fix #650/paging since bug #663 causes
duplicate rows. Still almost there though.
Boost::date_time doesn't accept some of the date and time formats that
the origin do (e.g. 2013-9-22 or 2013-009-22).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Refs #640
* Remove use of cluster key range for tables without CK
Checking CK existance once and use the info allows us to remove some
stupid complexity in checking for "last key" match
* With fix for #589 we can also remove some superfluous code to
compensate for that issue, and make "partition end" simper
* Remove extra row in CK case. Not needed anymore
End result is that pager now more or less only relies on adapted query
ranges.
timestamp_from_string() is used by both timestamp and date types, so it
is better to move the try { } catch { } to the functions itself instead
of expecting its callers to catch exceptions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
On AMI, scylla-server fails to systemctl restart because scylla_prepare tries to mount /var/lib/scylla even it's already mounted.
This patch fixes the issue.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
* seastar b44d729...51154f7 (6):
> semaphore: add with_semaphore()
> scripts: posix_net_conf.sh: don't transform wide CPU mask
> resource: fix build for systems without HWLOC
> build: link libasan before all other libraries
> Use sys_membarrier() when available
> build: add missing library (boost_filesystem)
Fixes#589
If we got no rows, but have live static columns, we should only
give them back IFF we did not have any CK restrictions.
If ck:s exist, and we have a restriction on them, we either have maching
rows, or return nothing, since cql does not allow "is null".
This patch introduces a test for reading keys from a single sstable with
the range begining and end being the keys present in the index summary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When choosing a relevant range of buckets it wasn't taken into account
whether the range bounds are inclusive or not. That may have resulted in
more buckets being read than necessary which was a condition not
expected by the code responsible from looking for a relevant keys inside
the buckets.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If a sstable doesn't belong to current shard, mark_for_deletion
should be called for the deletion manager to still work.
It doesn't mean that the sstable will be deleted, but that the
sstable is not relevant to the current shard, thus it can be
deleted by the deletion manager in the future.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When a node gain or regain responsibility for certain token ranges,
streaming will be performed, upon receiving of the stream data, the
row cache is invalidated for that range.
Refs #484.
From Paweł:
"This series fixes sstables::key_reader not respecting range inclusiveness
if the bounds were the keys that were present in the index summary.
Fixes #663."
This patch introduces a test for reading keys from a single sstable with
the range begining and end being the keys present in the index summary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When choosing a relevant range of buckets it wasn't taken into account
whether the range bounds are inclusive or not. That may have resulted in
more buckets being read than necessary which was a condition not
expected by the code responsible from looking for a relevant keys inside
the buckets.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This series adds support for nodetool command 'drain'. The general idea
of this command is to close all connection (both with clienst and other
nodes) and flush all memtables to disk.
Fixes #662."
"Merge AMI scripts to dist/common/scripts, make it usable on non-AMI
environments. Provides a script to do all settings automatically, which
able to run as one-liner like this:
curl http://url_to_scylla_install | sudo bash -s -- -d /dev/xvdb,/dev/xvdc -n eth0 -l ./
Also enables coredump, save it to /var/lib/scylla/coredump"
When the server is shutting down a flag _stopping is set and listeners
are aborted using abort_accept(), which causes accept() calls to return
failed futures. However, accept handler just checks that the flag
_stopping is set and returns which causes a failed future to be
destroyed and a warning is printed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The underlying data source for cache should not be the same memtable
which is later used to update the cache from. This fixes the following
assertion failure:
row_cache_test_g: utils/logalloc.hh:289: decltype(auto) logalloc::allocating_section::operator()(logalloc::region&, Func&&) [with Func = memtable::make_reader(schema_ptr, const partition_range&)::<lambda()>]: Assertion `r.reclaiming_enabled()' failed.
The problem is that when memtable is merged into cache their regions
are also merged, so locking cache's region locks the memtable region
as well.
Currently sourcing for the second time causes an exception from
pretty printer registration:
Traceback (most recent call last):
File "./scylla-gdb.py", line 41, in <module>
gdb.printing.register_pretty_printer(gdb.current_objfile(), build_pretty_printer())
File "/usr/share/gdb/python/gdb/printing.py", line 152, in register_pretty_printer
printer.name)
RuntimeError: pretty-printer already registered: scylla
Fixes the case where background activity needed to complete CL=ONE writes
is queued up in the storage proxy, and the client adds new work faster
than it can be cleared.
The last two loops were incorrectly inside the first one. That's a
bug because a new sstable may be emplaced more than once in the
sstable list, which can cause several problems. mark_for_deletion
may also be called more than once for compacted sstables, however,
it is idempotent.
Found this issue while auditing the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"* seastar 294ea30...b44d729 (5):
> Merge "Properly distribute IO queues" from Glauber
> reactor: allow more poll time in virtualized environments
> reactor: fix idle-poll limit
> reactor: use a vector of unique_ptr for the IO queues
> io queues: make the queues really part of the reactor"
With consistency level less then ALL mutation processing can move to
background (meaning client was answered, but there is still work to
do on behalf of the request). If background request rate completion
is lower than incoming request rate background request will accumulate
and eventually will exhaust all memory resources. This patch's aim is
to prevent this situation by monitoring how much memory all current
background request take and when some threshold is passed stop moving
request to background (by not replying to a client until either memory
consumptions moves below the threshold or request is fully completed).
There are two main point where each background mutation consumes memory:
holding frozen mutation until operation is complete in order to hint it
if it does not) and on rpc queue to each replica where it sits until it's
sent out on the wire. The patch accounts for both of those separately
and limits the former to be 10% of total memory and the later to be 6M.
Why 6M? The best answer I can give is why not :) But on a more serious
note the number should be small enough so that all the data can be
sent out in a reasonable amount of time and one shard is not capable to
achieve even close to a full bandwidth, so empirical evidence shows 6M
to be a good number.
I am sure it's a compiler issue but I am not ready to give up and
upgrade just yet:
sstables/compaction.cc:307:55: error: converting to ‘std::unordered_map<int, long int>’ from initializer list would use explicit constructor ‘std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = int; _Tp = long int; _Hash = std::hash<int>; _Pred = std::equal_to<int>; _Alloc = std::allocator<std::pair<const int, long int> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const int, long int> >]’
stats->start_size, stats->end_size, {});
Test was failing because _qp (distributed<cql3::query_processor>) was stopped
before _db (distributed<database>).
Compaction manager is member of database, and when database is stopped,
compaction manager is also stopped. After a2fb0ec9a, compaction updates the
system table compaction history, and that requires a working query context.
We cannot simply move _qp->stop() to after _db->stop() because the former
relies on migration_manager and storage_proxy. So the most obvious fix is to
clean the global variable that stores query context after _qp was stopped.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The previous patch added message_service read()/write() support for all
types which know how to serialize themselves through our "old" serialization
API (serialize()/deserialize()/serialized_size()).
So we no longer need the almost 200 lines of repetitive code in
messaging_service.{cc,hh} which defined these read/write templates
separately for a dozen different types using their *serialize() methods.
We also no longer need the helper functions read_gms()/write_gms(), which
are basically the same code as that in the template functions added in the
previous patch.
Compilation is not significantly slowed down by this patch, because it
merely replaces a dozen templates by one template that covers them all -
it does not add new template complexity, and these templates are anyway
instantiated only in messaging_service.cc (other code only calls specific
functions defined in messaging_service.cc, and does not use these templates).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, messaging_service only supports sending types for which a read/
write function has been explicitly implemented in messageing_service.hh/cc.
Some types already have serialization/deserialization methods inside them,
and those could have been used for the serialization without having to write
new functions for each of these types. Many of these types were already
supported explicitly in messaging_service.{cc,hh}, but some were forgot -
for example, dht::token.
So this patch adds a default implemention of messaging_service write()/read()
which will work for any type which has these serialization methods.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* seastar 5b9e3da...294ea30 (9):
> Merge "IO queues" from Glauber
> reactor: increment check_direct_io_support to also deal with files
> Merge "SSL/TLS initial certificate validation" from Calle
> tutorial.md: remove inaccurate statements about x86
> build: verify that the installed compiler is up to date
> build: complain if fossil version of gnutls is installed
> build: fix debian naming of gnutls-devel package
> build: add configure-time check for gnutls-devel
> tutorial.md: introduction to asynchrnous programming
send_to_live_endpoints() is never waited upon, it does its job in the
background. This patch formalize that by changing return value to void
and also refactoring code so that frozen_mutation shared pointer is not
held more that it should: currently it is held until send_mutation()
completes, but since send_mutation() does not use frozen_mutation
asynchronously this is not necessary.
Replace db_clock::now_in_usec() and db_clock::now() * 1000 accesses
where the intent is to create a new auto-generate cell timestamp with
a call to new_timestamp(). Now the knowledge of how to create timestamps
is in a single place.
"get_compactions returns progress information for each compaction
running in the system. It can be accessed using swagger UI.
'nodetool compactionstats' is not working yet because of some
pending work in the nodetool side."
Apparently, link hook copy constructor is a no-op and move contructor
doesn't exist so the code is correct, but that explicit move makes code
needlessly confusing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That's important for compaction stats API that will need stats
data of each ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This list will store compaction_stats for each ongoing compaction.
That's why register and deregister methods are provided.
This change is important for compaction stats API that needs data
of each ongoing compaction, such as progress, ks, cf, etc.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patchset will make Scylla update the system table
COMPACTION_HISTORY whenever a compaction job finishes.
Functions were added to both update and retrieve the
content of this system table. Compaction history API
is also enabled in this series."
When compaction job finishes, call function to update the system
table COMPACTION_HISTORY. That's also needed for the compaction
history API.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This method is intended to return content of the system table
COMPACTION_HISTORY as a vector of compaction_history_entry.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If a sstable doesn't belong to current shard, mark_for_deletion
should be called for the deletion manager to still work.
It doesn't mean that the sstable will be deleted, but that the
sstable is not relevant to the current shard, thus it can be
deleted by the deletion manager in the future.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There is a check whose intent was to detect wrap around during walk of
the ring tokens by comparing the split point with minimum token, which
is supposed to be inserted by the ring iterator. It assumed that when
we encounter it, the range is a wrap around. It doesn't hold when
minimum token is part of the token metadata or set of tokens is empty.
In such case, a full range would be split into 3 overlapping full
ranges. The fix is to drop the assumption and instead ensure that
ranges do not wrap around by unwrapping them if necessary.
Fixes#655.
The default move assignment operator calls boost::intrusive::set's move
assignment operator, which leaks, because it does not believe it owns
the data.
Fix by providing a custom implementation.
All components of prefixable compound type are preceeded by their
length what makes them not byte order comparable.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Frozen collection type names must be wrapped in FrozenType so that we
are able to store the types correctly in system tables.
This fixes#646 and fixes#580.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
The test_assignement() function is invoked via the Cassandra unit tests
so we might as well implement it.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
It is 30 seconds instead of 5 seconds by default. To align with c*.
Pleas note, after this a node will takes at least 30 seconds to complete
a bootstrap.
Originally, large allocation test case attempted to allocate an object
as big as halft of the space used by the lsa. That failed when the test
was executed with lower amount of memory available mainly due to the
memory fragmentation caused by previous test cases.
This patches reduces the size of the large allocation to 3/8 of the
total space used by the lsa which is still a lot but seems to make the
test pass even with as little memory as 64MB per shard.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If we get a core dump from a user, it is important to be able to
identify its version. Copy the release string into the heap (which is
copied into the code dump), so we can search for it using the "strings"
or "ident" commands.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
This fixes compile error:
In function `logalloc::segment_zone::segment_zone()':
/home/lmr/Code/scylla/utils/logalloc.cc:412: undefined reference to `logalloc::segment_zone::minimum_size'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
blob_storage defined with attribute packed which makes its alignment
requirement equal 1. This means that its members may be unaligned.
GCC is obviously aware of that and will generate appropriate code
(and not generate ubsan checks). However, there are few places where
members of blob_storage are accessed via pointers, these have to be
wrapped by unaligned_cast<> to let the compiler know that the location
pointed to may be not aligned properly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Form Paweł:
This series fixes support for clustering keys which trailing components
are null. The solution is to use clustering_key_prefix instead of
clustering_key everywhere.
Fixes#515.
Schemas using compact storage can have clustering keys with the trailing
components not set and effectively being a clustering key prefixes
instead of full clustering keys.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of non-compound dense tables the column name is just the value
of the clustering key (which has only one component). Current code just
casts clustering_key to bytes_view which works because there is no
additional metadata in single element clustering keys.
However, that may change when the internal representation of clustering
key is changed so explicitly extract the proper component.
This change will become necessary when clustering_key is replaced by
clustering_key_prefix.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of schemas that use compact storage it is possible that trailing
components of clustering keys are not set.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When this tool was written, we were still using /var/lib/cassandra as a default
location. We should update it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar 5dc22fa...c5e595b (3):
> memory: be less strict about NUMA bindings
> reactor: let the resource code specify the default memory reserve
> resource: reserve even more memory when hwloc is compiled in
Fixes#642
"This series attempts to make LSA more friendly for large (i.e. bigger
than LSA segment) allocations. It is achieved by introducing segment
zones – large, contiguous areas of segments and using them to allocate
segments instead of calling malloc() directly.
Zones can be shrunk when needed to reclaim memory and segments can be
migrated either to reduce number of zone or to defragment one in order
to be able to shrink it. LSA tries to keep all segments at the lower
addresses and reclaims memory starting from the zones in the highest
parts of the address space."
Also:
[PATCH scylla v1 0/7] gossip mark node down fix + cleanup
[PATCH scylla v1 0/2] Refuse decommissioned node to rejoin
[PATCH scylla] storage_service: Fix added node not showing up in nodetool in status joining
When replacing a node, we might ignore the tokens so that the tokens is
empty. In this case, we will have
std::unordered_map<inet_address, std::unordered_set<token>> = {ip, {}}
passed to token_metadata::update_normal_tokens(std::unordered_map<inet_address,
std::unordered_set<token>>& endpoint_tokens)
and hit the assert
assert(!tokens.empty());
1) Start node 1, node 2, node 3
2) Stop node 3
3) Start node 4 to replace node 3
4) Kill node 4 (removal of node 3 in system.peers is not flushed to disk)
5) Start node 4 (will load node 3's token and host_id info in bootup)
This makes
"Token .* changing ownership from 127.0.0.3 to 127.0.0.4"
messages printed again in step 5) which are not expected, which fails the dtest
FAIL: replace_first_boot_test (replace_address_test.TestReplaceAddress)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scylla-dtest/replace_address_test.py",
line 220, in replace_first_boot_test
self.assertEqual(len(movedTokensList), numNodes)
AssertionError: 512 != 256
In commit 56df32ba56 (gossip: Mark node as
dead even if already left). A node liveness check is missed.
Fix it up.
Before: (mark a node down multiple times)
[Tue Dec 8 12:16:33 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:33 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:34 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:34 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:35 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:35 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
After: (mark a node down only one time)
[Tue Dec 8 12:28:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:28:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
The only reason we needed it is to make
_application_state[key] = value
work.
With the current default constructor, we increase the version number
needlessly. To fix and to be safe, remove the default constructor
completely.
Backport: CASSANDRA-8801
a53a6ce Decommissioned nodes will not rejoin the cluster.
Tested with:
topology_test.py:TestTopology.decommissioned_node_cant_rejoin_test
The get_token_endpoint API should return a map of tokens to endpoints,
including the bootstrapping ones.
Use get_local_storage_service().get_token_to_endpoint_map() for it.
$ nodetool -p 7100 status
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 127.0.0.1 12645 256 ? eac5b6cf-5fda-4447-8104-a7bf3b773aba rack1
UN 127.0.0.2 12635 256 ? 2ad1b7df-c8ad-4cbc-b1f1-059121d2f0c7 rack1
UN 127.0.0.3 12624 256 ? 61f82ea7-637d-4083-acc9-567e0c01b490 rack1
UJ 127.0.0.4 ? 256 ? ced2725e-a5a4-4ac3-86de-e1c66cecfb8d rack1
Fixes#617
Originally, lsa allocated each segment independently what could result
in high memory fragmentation. As a result many compaction and eviction
passes may be needed to release a sufficiently big contiguous memory
block.
These problems are solved by introduction of segment zones, contiguous
groups of segments. All segments are allocated from zones and the
algorithm tries to keep the number of zones to a minimum. Moreover,
segments can be migrated between zones or inside a zone in order to deal
with fragmentation inside zone.
Segment zones can be shrunk but cannot grow. Segment pool keeps a tree
containing all zones ordered by their base addresses. This tree is used
only by the memory reclamer. There is also a list of zones that have
at least one free segments that is used during allocation.
Segment allocation doesn't have any preferences which segment (and zone)
to choose. Each zone contains a free list of unused segments. If there
are no zones with free segments a new one is created.
Segment reclamation migrates segments from the zones higher in memory
to the ones at lower addresses. The remaining zones are shrunk until the
requested number of segments is reclaimed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
A dynamic bitset implementation that provides functions to search for
both set and cleared bits in both directions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently test case "Testing reading when memory can't be reclaimed."
assumes that the allocation section used by row cache upon entering
will require more free memory than there is available (inc. evictable).
However, the reserves used by allocation section are adjusted
dynamically and depend solely on previous events. In other words there
is no guarantee that the reserve would be increased so much that the
allocation will fail.
The problem is solved by adding another allocation that is guaranteed
to be bigger than all evictable and free memory.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Scattering of blobs from Avi:
This patchset converts the stack to scatter managed_bytes in lsa memory,
allowing large blobs (and collections) to be stored in memtable and cache.
Outside memtable/cache, they are still stored sequentially, but it is assumed
that the number of transient objects is bounded.
The approach taken here is to scatter managed_bytes data in multiple
blob_storage objects, but to linearize them back when accessing (for
example, to merge cells). This allows simple access through the normal
bytes_view. It causes an extra two copies, but copying a megabyte twice
is cheap compared to accessing a megabyte's worth of small cells, so
per-byte throughput is increased.
Testing show that lsa large object space is kept at zero, but throughput
is bad because Scylla easily overwhelms the disk with large blobs; we'll
need Glauber's throttling patches or a really fast disk to see good
throughput with this.
Add linearize() and unlinearize() methods that allow making an
atomic_cell_or_collection object temporarily contiguous, so we can examine
it as a bytes_view.
Instead of allocating a single blob_storage, chain multiple blob_storage
objects in a list, each limited not to exceed the allocation_strategy's
max_preferred_allocation_size. This allows lsa to allocate each blob_storage
object as an lsa managed object that can be migrated in memory.
Also provide linearize()/scatter() methods that can be used to temporarily
consolidate the storage into a single blob_storage. This makes the data
contiguous, so we can use a regular bytes_view to examine it.
This adds the implementation for the index_summary_off_heap_memory for a
single column family and for all of them.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.
memory_footprint is used by the API to report the memory that is taken
by the summary.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
If there is no snapshot directory for the specific column family,
get_snapshot_details should return an empty map.
This patch check that a directory exists before trying to iterate over
it.
Fixes#619
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Scylla changes:
sstable.cc: Remove file_exists() function which conflicts with seastar's
Amnon Heiman (2):
reactor: Add file_exists method
Add a wrapper for file_exists
Avi Kivity (2):
Merge "Introduce shared_future" from Tomasz
Merge ""scripts: a few fixes in posix_net_conf.sh" from Vlad
Gleb Natapov (3):
rpc: not stop client in error state
avoid allocation in parallel_for_each is there is nothing to do
memory: fix size_to_idx calculation
Nadav Har'El (1):
test: fix use-after-free in timertest
Pawe�� Dziepak (1):
memory: use size instead of old_size to shrink memory block
Tomasz Grabiec (7):
file: Mark move constructor as noexcept
core: future: Add static asserts about type's noexcept guarantees
core: future: Drop now redundant move_noexcept flag
core: future_state: Make state getters non-destructive for non-rvalue-refs
core: future: Make get_available_state() noexcept
core: Introduce shared_future
Make json_return_type movable
Vlad Zolotarov (8):
scripts: posix_net_conf.sh: ban NIC IRQs from being moved by irqbalance
scripts: posix_net_conf.sh: exclude CPU0 siblings from RPS
scripts: posix_net_conf.sh: Configure XPS
scripts: posix_net_conf.sh: Add a new mode for MQ NICs
scripts: posix_net_conf.sh: increase some backlog sizes
core: to_sstring(): cleanup
core: to_sstring_strintf(): always use %g(or %lg) format for floating point values
core: prevent explicit calls for to_sstring_sprintf()
In a recent discussion with the XFS developers, Dave Chinner recommended
us *not* to use discard, but rather issue fstrims explicitly. In machines
like Amazon's c3-class, the situation is made worse by the fact that discard
is not supported by the disk. Contrary to my intuition, adding the discard
mount option in such situation is *not* a nop and will just create load
for no reason.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Objects extending json_base are not movable, so we won't be able to
pass them via future<>, which will assert that types are nothrow move
constructible.
This problem only affects httpd::utils_json::histogram, which is used
in map-reduce. This patch changes the aggregation to work on domain
value (utils::ihistrogram) instead of json objects.
Our premier allocation_strategy, lsa, prefers to limit allocations below
a tenth of the segment size so they can be moved around; larger allocations
are pinned and can cause memory fragmentation.
Provide an API so that objects can query for this preferred size limit.
For now, lsa is not updated to expose its own limit; this will be done
after the full stack is updated to make use of the limit, or intermediate
steps will not work correctly.
The config file expresses this number in MB, while total_memory() gives us
a quantity in bytes. This causes the commitlog not to flush until we reach
really skyhigh numbers.
While we need this fix for the short term before we cook another release,
I will note that for the mid/long term, it would be really helpful to stop
representing memory amounts as integers, and use an explicit C++ type for
those. That would have prevented this bug.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Print a map in the form of [(]{ key0 : value0 }[, { keyN : valueN }]*[)]
The map is printed inside () brackets if it's frozen.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In origin, there are two APIs to get the information about the current
running compactions. Both APIs do the string formatting.
This patch changes the API to have a single API get_compaction that
would return a list of summary object.
The jmx would do the string formatting for the two APIs.
This change gives a better API experience is it's better documented and
would make it easier to support future format changes in origin.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
That's what we're trying to standardize on.
This patch also fixes an issue with current query::result::serialize()
not being const-qualified, because it modifies the
buffer. messaging_service did a const cast to work this around, which
is not safe.
This patch adds the implementation to the get_version.
After this patch the following url will be available:
messaging_service/version?addr=127.0.0.1
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"This series allows the compaction manager to be used by the nodetool as a stub implementation.
It has two changes:
* Add to the compaction manager API a method that returns a compaction info
object
* Stub all the compaction method so that it will create an unimplemented
warning but will not fail, the API implementation will be reverted when the
work on compaction will be completed."
This patch fixes the following cql_query_test failure.
cql_query_test: scylla/seastar/core/sharded.hh:439:
Service& seastar::sharded<Service>::local() [with Service =
gms::gossiper]: Assertion `local_is_initialized()' failed.
The problem is in gossiper::stop() we call gossip::add_local_application_state()
which will in turn call gms::get_local_gossiper(). In seastar::sharded::stop
_instances[engine().cpu_id()].service = nullptr;
return inst->stop().then([this, inst] {
return _instances[engine().cpu_id()].freed.get_future();
});
We set the _instances to nullptr before we call the stop method, so
local_is_initialized asserts when we try to access get_local_gossiper
again.
To fix, we make the stopping of gossiper explicit. In the shutdown
procedure, we call stop_gossiping() explicitly.
This has two more advantages:
1) The api to stop gossip is now calling the stop_gossiping() instead of
sharing the seastar::sharded's stop method.
2) We can now get rid of the _handler seastar::sharded helper.
The add interface of the estimated histogram is confusing as it is not
clear what units are used.
This patch removes the general add method and replace it with a add_nano
that adds nanoseconds or add that gets duration.
To be compatible with origin, nanoseconds vales are translated to
microseconds.
This patch adds a started counter, that is used to mark the number of
operation that were started.
This counter serves two purposes, it is a better indication for when to
sample the data and it is used to indicate how many pending operations
are.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the column family API that return the snapshot size.
The changes in the swagger definition file follo origin so the same API will be used for the metric and the
column_family.
The implementation is based on the get_snapshot_details in the
column_family.
This fix:
425
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Backport: CASSANDRA-10330
ae4cd69 Print versions for gossip states in gossipinfo
For instance, the version for each state, which can be useful for
diagnosing the reason for any missing states. Also instead of just
omitting the TOKENS state, let's indicate whether the state was actually
present or not.
With
Node 1 (Seed node, Port 7000 is opened, 10.184.9.144)
Node 2 (Port 7000 is opened, 10.184.9.145)
Node 3 (Port 7000 is blocked by firewall)
On Node 3, we saw the following error which was very confusing: Node 3
saw Node 1 and Node 3 but it complained it can not contact any seeds.
The message "Node 10.184.9.144 is now part of the cluster" and friends
are actually messages printed during the gossip shadow round where Node
3 connects to Node 1's port 7000 and Node 1 returns all info it knows to
Node 3, so that Node 3 knows Node 1 and Node 2 and we see the "Node
10.184.9.144/145 is now part of the cluster" message.
However, during the normal gossip round, Node 3 will not mark Node 1 and
Node 2 UP until the Seed node initiates a gossip round to Node 3, (note
port 7000 on node 3 is blocked in this case). So Node 3 will not mark
Node 1 and Node 2 UP and we see the "Unable to contact any seeds" error.
[shard 0] storage_service - Loading persisted ring state
[shard 0] gossip - Node 10.184.9.144 is now part of the cluster
[shard 0] gossip - inet_address 10.184.9.144 is now UP
[shard 0] gossip - Node 10.184.9.145 is now part of the cluster
[shard 0] gossip - inet_address 10.184.9.145 is now UP
[shard 0] storage_service - Starting up server gossip
scylla_run[12479]: Start gossiper service ...
[shard 0] storage_service - JOINING: waiting for ring information
[shard 0] storage_service - JOINING: schema complete, ready to bootstrap
[shard 0] storage_service - JOINING: waiting for pending range calculation
[shard 0] storage_service - JOINING: calculation complete, ready to bootstrap
[shard 0] storage_service - JOINING: getting bootstrap token
[shard 0] storage_service - JOINING: sleeping 5000 ms for pending range setup
scylla_run[12479]: Exiting on unhandled exception of type 'std::runtime_error': Unable to contact any seeds!
Backported: CASSANDRA-8336 and CASSANDRA-9871
84b2846 remove redundant state
b2c62bb Add shutdown gossip state to prevent timeouts during rolling restarts
8f9ca07 Cannot replace token does not exist - DN node removed as Fat Client
Fixes:
When X is shutdown, X sends SHUTDOWN message to both Y and Z, but for
some reason, only Y receives the message and Z does not receive the
message. If Z has a higher gossip version for X than Y has for
X, Z will initiate a gossip with Y and Y will mark X alive again.
X ------> Y
\ /
\ /
Z
Fixes: #593
"Changes the parser/replayer to treat data corruption as non-fatal,
skipping as little as possible to get the most data out of a segment,
but keeping track of, and reporting, the amount corrupted.
Replayer handles this and reports any non-fatal errors on replay finish.
Also added tests for corruption cases.
This patch series contains a cleanup-patch for commitlog_tests that was
previously submitted, but got lost."
If something bad happens between write request handler creation and
request execution the request handler have to be destroyed. Currently
code tries to do that explicitly in all places where request may be
abandoned, but it misses some (at least one). This patch replaces this
by introducing unique_response_handler object that will remove the handler
automatically if request is not executed for some reason.
Rename antlr3-tool to antlr3 (same as distribution package), and use distribution version if it's available
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
"Before this change, populations could race with update from flushed
memtable, which might result in cache being populated with older
data. Populations started before the flush are not considering the
memtable nor its sstable.
The fix employed here is to make update wait for populations which
were started before the flushed memtable's sstable was added to the
undrelying data source. All populatinos started after that are
guaranteed to see the new data. The update() call will wait only for
current populating reads to complete, it will not wait for readers to
get advanced by the consumer for instance."
To avoid a race where natural endpoint was updated to contain node A,
but A was not yet removed from pending endpoints.
This fixes the root cause of commit d9d8f87c1 (storage_proxy: filter out
natural endpoints from pending endpoint). This patch alone fixes#539,
but we still want commit d9d8f87c1 to be safe.
When other bootstrapping/leaving/moving nodes are found during
bootstrap, instead of throwing immediately, sleep and try again for one
minute, hoping other nodes will finish the operation soon.
Since we are retrying using shadow gossip round more than once, we need
to put the gossip state back to shadow round after each shadow round, to
make shadow round works correctly.
This is useful when starting an empty cluster for testing. E.g,
$ scylla --listen-address 127.0.0.1
$ sleep 3
$ scylla --listen-address 127.0.0.2
$ sleep 3
$ scylla --listen-address 127.0.0.3
Without this patch, node 3 will hit the check.
TIME STATUS
-----------------------
Node 1:
32:00 Starts
32:00 In NORMAL status
Node 2:
32:03 Starts
32:04 In BOOT status
32:10 In NORMAL status
Node 3:
32:06 Starts
32:06 Found node 2 in BOOT status, hit the check, sleep and try again
32:11 Found node 2 in NORMAL status, can keep going now
32:12 In BOOT status
32:18 In NORMAL status
When other bootstrapping/leaving/moving nodes are found during
bootstrap, instead of throwing immediately, sleep and try again for one
minute, hoping other nodes will finish the operation soon.
This is useful when starting an empty cluster for testing. E.g,
$ scylla --listen-address 127.0.0.1
$ scylla --listen-address 127.0.0.2
$ scylla --listen-address 127.0.0.3
Without this patch, node 3 will hit the check.
TIME STATUS
-----------------------
Node 1:
25:19 Starts
25:20 In NORMAL status
Node 2:
25:19 Starts
25:23 In BOOT status
25:28 In NORMAL status
Node 3:
25:19 Starts
25:24 Found node 2 in BOOT status, hit the check, sleep and try again
25:29 Found node 2 in NORMAL status, can keep going now
25:29 In BOOT status
25:34 In NORMAL status
Before this change, populations could race with update from flushed
memtable, which might result in cache being populated with older
data. Populations started before the flush are not considering the
memtable nor its sstable.
The fix employed here is to make update wait for populations which
were started before the flushed memtable's sstable was added to the
undrelying data source. All populatinos started after that are
guaranteed to see the new data.
The text data type is no longer present in CQL binary protocol v3 and
later. We don't need it for encoding earlier versions either because
it's an alias for varchar which is present in all CQL binary protocol
versions.
Fixes#526.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
This patch plus pekka's previous commit 3c72ea9f96
"gms: Fix gossiper::handle_major_state_change() restart logic"
fix CASSANDRA-7816.
Backported from:
def4835 Add missing follow on fix for 7816 only applied to
cassandra-2.1 branch in 763130bdbde2f4cec2e8973bcd5203caf51cc89f
763130b Followup commit for 7816
2199a87 Fix duplicate up/down messages sent to native clients
Tested by:
pushed_notifications_test.py:TestPushedNotifications.restart_node_test
CQL 3.2.1 introduces a "TRUNCATE TABLE X" alias for "TRUNCATE X":
4e3555c1d9
Fix our CQL grammar to also support that.
Please note that we don't bump up advertised CQL version yet because our
cqlsh clients won't be able to connect by default until we upgrade them
to C* 2.1.10 or later.
Fixes#576
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
The FIXMEs are no longer valid, we load schema on bootstrap and don't
support hot-plugging of column families via file system (nor does
Cassandra).
Handling of missing tables matches Cassandra 2.1, applies log
it and continue, queries propagate the error.
If request comes after natural endpoint was updated to contain node A,
but A was not yet removed from pending endpoints it will be in both and
write request logic cannot handle this properly. Filter nodes which are
already in natural endpoint from pending endpoint to fix this.
Fixes#539.
boost::heap::binomial_heap allocates helper object in push() and,
therefore, may throw an exception. This shouldn't happen during
compaction.
The solution is to reserve space for this helper object in
segment_descriptor and use a custom allocator with
boost::heap::binomial_heap.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
LSA memory reclaimer logic assumes that the amount of memory used by LSA
equals: segments_in_use * segment_size. However, LSA is also responsible
for eviction of large objects which do not affect the used segmentcount,
e.g. region with no used segments may still use a lot of memory for
large objects. The solution is to switch from measuring memory in used
segments to used bytes count that includes also large objects.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since this won't check disk types, may re-initialize RAID on EBS when first block was lost.
But in such condition, probably re-initialize RAID is the only choice we can take, so this is fine.
Fixes#364.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
With this patch, start two nodes
node 1:
scylla --rpc-address 127.0.0.1 --broadcast-rpc-address 127.0.0.11
node 2:
scylla --rpc-address 127.0.0.2 --broadcast-rpc-address 127.0.0.12
On node 1:
cqlsh> SELECT rpc_address from system.peers;
rpc_address
-------------
127.0.0.12
which means client should use this address to connect node 2 for cql and
thrift protocol.
It is same as
-Dcassandra.consistent.rangemovement
in cassandra.
Use it as:
$ scylla --consistent-rangemovement 0
or
$ scylla --consistent-rangemovement 1
Messaging service closes connection in rpc call continuation on
closed_error, but the code runs for each outstanding rpc call on the
connection, so first continuation may destroy genuinely closed connection,
then connection is reopened and next continuation that handless previous
error kills now perfectly healthy connection. Fix this by closing
connection only in error state.
From Avi:
Origin supports a notion of empty values for non-container types; these
are serialized as zero-length blobs. They are mostly useless and only
retained for compatibility.
The implementation here introduces a wrapper maybe_empty<T>, similar to
optional<T> but oriented towards usually-nonempty usage with implicit
conversion.
There is more work needed for full empty support: fixing up deserializers to
create empty values instead of nulls, and splitting up data_value into
data_value and a data_value_nonnull for the cases that require it.
(I chose maybe_empty<> rather than using optional<data_value> for nullable
data_value both because it requires fewer changes, and because
optional<data_value> introduces a lot of control flow when moving or copying,
which would be mostly useless in most cases).
This cleanup patch got lost in git-space some time ago. It is however sorely
needed...
* Use cleaner wrapper for creating temp dir + commit log, avoiding
having to clear and clean in every test, etc.
* Remove assertions based on file system checks, since these are not
valid due to both the async nature of the CL, and more to the point,
because of pre-allocation of files and file blocks. Use CL
counters/methods instead
* Fix some race conditions to ensure tests are safe(r)
* Speed up some tests
Discern fatal and non-fatal excceptions, and handle data corruption
by adding to stats, resporting it, but continue processing.
Note that "invalid_arguement", i.e. attempting to replay origin/old
segments are still considered fatal, as it is probably better to
signal this strongly to user/admin
Parser object now attempts to skip past/terminate parsing on corrupted
entries/chunks (as detected by invalid sizes/crc:s). The amount of data
skipped is kept track of (as well as we can estimate - pre-allocation
makes it tricky), and at the end of parsing/reporting, IFF errors
occurred, and exception detailing the failures is thrown (since
subsciption has little mechanism to deal with this otherwise).
Thus a caller can decide how to deal with data corruption, but will be
given as many entries as possible.
An empty serialized representation means an empty value, not NULL.
Fix up the confusion by converting incorrect make_null() calls to a new
make_empty(), and removing make_null() in empty-capable types like
bytes_type.
Collections don't support empty serialized representations, so remove
the call there.
Paramter evaluation order is unspecified, so it's possible that the
move of 'schema' into lambda captures would happen before construction of
mutation.
"To speed up boot, parallelism was introduced to our code that loads
sstables from a column family, a function was implemented to read
the minimum from a sstable to determine whether it belongs to the
current shard, and buffer size in read simple is dynamically chosen
based on the size of the file and dma alignment.
The latter is important because filter file can be considerably
large when the respective sstable (data file) is very large.
Before this patchset, scylla took about 5 minutes to boot with a
data directory of 660GB. After this patchset, scylla took about 20
seconds to boot with the same data directory."
Avi says:
"A small buffer size will hurt if we read a large file, but
a large buffer size won't hurt if we read a small file, since
we close it immediately."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we only determine if a sstable belongs to current shard
after loading some of its components into memory. For example,
filter may be considerably big and its content is irrelevant to
decide if a sstable should be included to a given shard.
Start using the functions previously introduced to optimize the
sstable loading process. add_sstable no longer checks if a sstable
is relevant to the current shard.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Boot may be slow because the function that loads sstables do so
serially instead of in parallel. In the callback supplied to
lister::scan_dir, let's push the future returned by probe_file
(function that loads sstable) into a vector of future and wait
for all of them at the end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We cannot share some dependency package names between 14.04 and 15.10, so need to add ifdefs.
Not tested on other version of Ubuntu.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Origin supports (https://issues.apache.org/jira/browse/CASSANDRA-5648) "empty"
values even for non-container types such as int. Use maybe_empty<> to
encapsulate abstract_type::native_type, adding an empty flag if needed.
Similar to optional<>, with the following differences:
- decays back to the encapsulated type, with an emptiness check;
this reflects the expectation that the value will rarely be empty
- avoids conditionals during copy/move (and requires a default constructor),
again with the same expectation.
When we start to sending mutations for cf_id to remote node, remote node
might do not have the cf_id anymore due to dropping of the cf for
instance.
We should not fail the streaming if this happens, since the cf does not
exist anymore there is no point streaming it.
Fixes#566
When a new node joins a cluster, it will starts a gossip round with seed
node. However, within this round, the seed node will not tell the new
node anything it knows about other nodes in the cluster, because the
digest in the gossip SYN message contains only the new node itself and
no other nodes. The seed node will pick randomly from the live nodes,
including the newly added node in do_gossip_to_live_member to start a
gossip round. If the new node is "lucky", seed node will talk to it very
soon and tells all the information it knows about the cluster, thus the
new node will mark the seed node alive and think it has seen the seed
node. If there considerably large number of live nodes, it might take a
long time before the seed node pick the new node and talk to it.
In bootstrap code, storage_service::bootstrap checks if we see any nodes
after sleep of RING_DELAY milliseconds and throw "Unable to contact any
seeds!" if not, thus the node will fail to bootstrap.
To help the seed node talk to new node faster, we favor new node in
do_gossip_to_live_member.
In origin, get_all_endpoint_states perform all the information
formatting and returns a string.
This is not a good API approach, this patch replaces the implementation
so the API will return an array of values and the JMX will do the
formatting.
This is a better API and would make it simpler in the future to stay in
sync with origin output.
This patch is part of #508
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#551.
Change mountpoint to /var/lib/scylla, copy conf/ on it.
Note: need to replace conf/ with symlink to /etc/scylla when new rpm uploaded on yum repository.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Pekka Enberg <penberg@iki.fi>
If we get a partition with no row data, but statics, we should treat this as
a row (include in count), but also make sure we skip to next partition
if our page ends here.
The "end partition" with zero rows but static data can also happen if we
happen to resume paging by giving a column range exluding all data. In this
case we should _not_ include it, since we have already provided the
data in question in previous page.
Fixes#556
1.) Should not reset to input query state if run repeatedly
2.) And if run repeatedly without input state, likewise keep
the internal one active
Fixes#560
"To keep compatibility with scylla-tools-java, it links /etc/scylla to /var/lib/scylla/conf.
Problem on this patchset is, I added SCYLLA_HOME and SCYLLA_CONF on /etc/sysconfig/scylla-server.
However, the file is marked as config file, it won't be automatically upgrade.
If user doesn't upgrade the file manually, scylla-server still able to run with /var/lib/scylla/conf because we have symlink, but never switches to /etc/scylla."
While the objects above max_manage_object_size aren't stored in the
LSA segments they are still considered to be belonging to the LSA
region and are evictable using that region evictor.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This series adds the natural_endpoints API. It adds the implementation to the storage_service and to the storage_service API.
After this series the noodtool command getendpoints should work.
example:
$ bin/nodetool getendpoints keyspace1 standard1 0x5032394c323239385030127.0.0.2
127.0.0.2"
This patch adds the API for timeout messages and dropped messages.
For dropped messages, origin has two APIs one for messages and one for
command.
droped messages return the number of messages per ver, so our API was
rename to reflect that.
For dropped messages (command) we currently do not have this logic of
throwing messages before sending, so the API will always return 0.
The total timeout API was removed and will be done on the jmx proxy
level.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
If listen_address is different than broadcast_address, we should use
broadcast_address for the seeds list. Check and ask user to fix the
configuration, e.g.,
$ scylla --rpc-address 127.0.0.1 --listen-address 127.0.0.1 --broadcast-address 192.168.1.100 --seed-provider-parameters seeds=127.0.0.1
Use broadcast_address instead of listen_address for seeds list: seeds={127.0.0.1}, listen_address=127.0.0.1, broadcast_address=192.168.1.100
Exiting on unhandled exception of type 'std::runtime_error': Use broadcast_address for seeds list
Write handler keeps track of all endpoints that not yet acked mutation
verb. It uses broadcast address as an enpoint id, but if local address
is different from broadcast address for local enpoints acknowledgements
will come from different address, so socket address cannot be used as
an acknowledgement source. Origin solves this by sending "from" in each
message, it looks like an overhead, solve this by providing endpoint's
broadcast address in rpc client_info and use that instead.
The restart logic is wrong because C* had a bug in
bf599fb5b062cbcc652da78b7d699e7a01b949ad and they fixed later and we
translated the broken version. We must check if there is an existing
endpoint state and call on_restart() hooks on that, not the newly
available endpoint state.
Spotted while inspecting the code.
Acked-by: Asias He <asias@scylladb.com>
From Avi:
Memtables do not use an allocating_section to guard against allocation
failure, and hence can fail an allocation. Reproducible by changing
perf_mutation to use an allocating type (bytes_type with a nontrivial
size) and making the loop longer.
Fix by using an allocating_section.
Recently, I have introduced cf_stats into the database, propagating all the way
back to the column family. The problem, however, is that some tests create a
column family config themselves instead of going through make_column_family.
That is ultimately ok if those tests are not expected to flush memtables. But
if they are, the cf_stats pointer will be null and we will crash. Although
there are many solutions to this, the one that is in tune with our current
practices is to have the test that requires it provide an empty cf_stats storage
area that can be written to. That's already how we handle the disk directory and
other things like compaction properties.
With this patch, test.py passes again.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch substitutes uint64_t for uint32_t as the type for
commitlog_total_space_in_mb. Moving to 64 is not strictly needed, since even a
signed 32-bit type would allow us to easily handle 2TB. But since we store that
in the commitlog as a 64-bit value, let's match it.
Moving from unsigned to signed, however, allow us to represent negative
numbers. With that in place, we can change the semantics of the value
slightly, so to allow a negative number to mean "all memory".
The reason behind this, is that the default value "8GB", is an artifact of the
JVM. We don't need that, and in many-shards configuration, each shard flushes
the commitlog way too often, since 8GB / many_shards = small_number.
8GB also happens to be a popular heap size for C* in the JVM. For us, we would
like to equate that (at least) with the amount of memory. The problem is how to
do that without introducing new options or changing the semantics of existing
options too radically.
The proposed solution will allow us to still parse C* yaml files, since those
will always have positive numbers, while introducing our own defaults.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Debian package system has two types of package, 'native' and 'non-native'.
'native' is the package just for Debian, it contains debian/ directory source tar.gz, doesn't have debian.tar.gz.
'non-native' has orig.tar.gz which is upstream source code tar ball, then it has debian.tar.gz which contains debian/ directory.
Scylla is 'native' now but should be 'non-native' since this is not just for Debian, so move debian/ to dist/ubuntu/, make orig.tar.gz using git-archive-all, copy dist/ubuntu/debian/ to debian/ then generate debian.tar.gz.
atomic_cell will soon become type-aware, so add helpers to class operation
that can supply the type, as it is available in operation::column.type.
(the type will be used in following patches)
schema_tables manages some boolean columns stored in system tables; it
dynamically creates them from C++ values. But as we lacked bool->data_value
conversion, the C++ value was converted to a int32_type. Somehow this didn't
cause any problems, but with some pending patches I have, it does.
Add a bool->data_value converting constructor to fix this.
Since bytes is a very generic value that is returned from many calls,
it is easy to pass it by mistake to a function expecting a data_value,
and to get a wrong result. It is impossible for the data_value constructor
to know if the argument is a genuine bytes variable, a data_value of another
type, but serialized, or some other serialized data type.
To prevent misuse, make the data_value(bytes) constructor
(and complementary data_value(optional<bytes>) explicit.
When do_stop_native_transport exits, cserver is destroyed which can
happen before cserver->stop(). Fix by capturing cserver in
cserver->stop()'s continuation to extend its lifetime. The same for
thrift server.
scylla: scylla/seastar/core/sharded.hh:327: seastar::sharded<Service>::~sharded()
[with Service = transport::cql_server]: Assertion `_instances.empty()' failed.
When analyzing a recent performance issue, I found helpful to keep track of
the amount of memtables that are currently in flight, as well as how much memory
they are consuming in the system.
Although those are memtable statistics, I am grouping them under the "cf_stats"
structure: being the column family a central piece of the puzzle, it is reasonable
to assume that a lot of metrics about it would be potentially welcome in the future.
Note that we don't want to reuse the "stats" structure in the column family: for once,
the fields not always map precisely (pending flushes, for instance, only tracks explicit
flushes), and also the stats structure is a lot more complex than we need.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
* seastar 5c10d3e...20bf03b (5):
> do not re-throw exception to get to an exception pointer
> Adding timeout counter to the rpc
> configure.py: support for pkg-config before release 0.28
> future: don't forget to warn about ignored exception
> tutorial: continue network API section
Found by debug build
==10190==ERROR: AddressSanitizer: new-delete-type-mismatch on 0x602000084430 in thread T0:
object passed to delete has wrong type:
size of the allocated type: 16 bytes;
size of the deallocated type: 8 bytes.
#0 0x7fe244add512 in operator delete(void*, unsigned long) (/lib64/libasan.so.2+0x9a512)
#1 0x3c674fe in std::default_delete<dht::range_streamer::i_source_filter>::operator()(dht::range_streamer::i_source_filter*)
const /usr/include/c++/5.1.1/bits/unique_ptr.h:76
#2 0x3c60584 in std::unique_ptr<dht::range_streamer::i_source_filter, std::default_delete<dht::range_streamer::i_source_filter> >::~unique_ptr()
/usr/include/c++/5.1.1/bits/unique_ptr.h:236
#3 0x3c7ac22 in void __gnu_cxx::new_allocator<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >::destroy<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >(std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> >*) /usr/include/c++/5.1.1/ext/new_allocator.h:124
...
Fixes#549.
Being clinically absent-minded, aggregate query support (i.e. count(...))
was left out of the "paging" change set.
This adds repeated paged querying to do aggregate queries (similar to
origin). Uses "batched" paging.
Until the compaction manager api would be ready, its failing command
causes problem with nodetool related tests.
Ths patch stub the compaction manager logic so it will not fail.
It will be replaced by an actuall implementation when the equivelent
code in compaction will be ready.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a compaction info object and an API that returns it.
It will be mapped to the JMX getCompactions that returns a map.
The use of an object is more RESTFull and will be better documented in
the swagger definition file.
For compatibility reasons, compaction_strategy should accept both class
name strategy and the full class name that includes the package name.
In origin the result name depends on the configuration, we cannot mimic
that as we are using enum for the type.
So currently the return class name remains the class itself, we can
consider changing it in the future.
If the name is org.apache.cassandra.db.compaction.Name the it will be
compare as Name
The error message was modified to report the name it was given.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#545
"Slight file format change for commitlog segments, now incluing
a scylla "marker". Allows for fast-fail if trying to load an
Origin segment.
WARNING: This changes the file format, and there is no good way for me to
check if a CL is "old" scylla, or Origin (since "version" is the same). So
either "old" scylla files also fail, or we never fail (until later, and
worse). Thus, if upgrading from older to this patch ensure to
have cleaned out all commit logs first."
Fixes#355
"Implements query paging similar to origin. If driver sets a "page size" in
a query, and we cannot know that we will not exceed this limit in a single
query, the query is performed using a "pager" object, which, using modified
partition ranges and query limits, keeps track of returned rows to "page"
through the results.
Implementation structure sort of mimics the origin design, even though it
is maybe a little bit overkill for us (currently). On the other hand, it
does not really hurt.
This implementation is tested using the "paging_test" subset in dtest.
It passes all test except:
* test_paging_using_secondary_indexes
* test_paging_using_secondary_indexes_with_static_cols
* test_failure_threshold_deletions
The two first because we don't have secondary indexes yet, the latter
because the test depends on "tombstone_failure_threshold" in origin.
Potential todo: Currently the pager object does not shortcut result
building fully when page limit is exceeded. Could save a little work
here, but probably not very significant."
Allows us fail fast if someone tries to replay an Origin commit log.
WARNING: This changes the file format, and there is no good way for me to
check if a CL is "old" scylla, or Origin (since "version" is the same). So
either "old" scylla files also fail, or we never fail (until later, and
worse). Thus, if upgrading from older to this patch, likewise, ensure to
have cleaned out all commit logs first.
* Static query method to determine if paging might be required
(very conservative - almost all querys will be paged me thinks).
* Static factory method for pager
* Actual pager implementation
Pager object uses three variables to keep track of paging state:
1.) Last partition key - partition key of last partion processed
-> next partition to start process
2.) Last clustering key, i.e. row offset within last key partition,
i.e. how far we got last time
3.) Max remaining - max rows to process further, i.e. initial limit -
processed so far
Partition ranges are modified/removed so that we begin with "Last key",
if present. (Or end with, in the case of reversed processing)
A counting visitor then keeps count of rows to include in processing.
Basic interface for paging control objects.
We probably do not need virtual behaviour for paging, but on the other
hand it does not really cost much, and it keeps a nice symmetry with
origin.
Allows for having more than one clustering row range set, depending on
PK queried (although right now limited to one - which happens to be exactly
the number of mutiplexing paging needs... What a coincidence...)
Encapsulates the row_ranges member in a query function, and if needed holds
ranges outside the default one in an extra object.
Query result::builder::add_partition now fetches the correct row range for
the partition, and this is the range used in subsequent iteration.
Note: serial format blob is different compared to origin, due to scyllas
different internal architecture. I.e. we query actual rows.
But drivers etc ignore the content of the blob, it is opaque.
Currently, there are multiple places we can close a session, this makes
the close code path hard to follow. Remove the call to maybe_completed
in follower_start_sent to simplify closing a bit.
- stream_session::follower_start_sent -> maybe_completed()
- stream_session::receive_task_completed -> maybe_completed()
- stream_session::transfer_task_completed -> maybe_completed()
- on receive of the COMPLETE_MESSAGE -> complete()
nodetool decommission node 127.0.0.2, on node 127.0.0.1, I saw:
DEBUG [shard 0] gossip - failure_detector: Forcing conviction of 127.0.0.1
TRACE [shard 0] gossip - convict ep=127.0.0.1, phi=8, is_alive=1, is_dead_state=0
TRACE [shard 0] gossip - marking as down 127.0.0.1
INFO [shard 0] gossip - inet_address 127.0.0.1 is now DOWN
DEBUG [shard 0] storage_service - on_dead endpoint=127.0.0.1
This is wrong since the argument for send_gossip_shutdown should be the
node being shutdown instead of the live node.
Since the introduction of sets::element_discarder sets::discarder is
always given a set, never a single value.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently sets::discarder is used by both set difference and removal of
a single element operations. To distinguish between them the discarder
checks whether the provided value is a set or something else, this won't
work however if a set of frozen sets is created.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
This reverts commit fff37d15cd.
Says Tomek (and the comment in the code):
"update_cache() must be called before unlinking the memtable because cache + memtable at any time is supposed to be authoritative source of data for contained partitions. If there is a cache hit in cache, sstables won't be checked. If we unlink the memtable before cache is updated, it's possible that a query will miss data which was in that unlinked memtable, if it hits in the cache (with an old value)."
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
nodetool decommission hangs forever due to a recursive lock.
decommission()
with api lock
shutdown_client_servers()
with api lock
stop_rpc_server()
with api lock
stop_native_transport()
Fix it by calling helpers for stop_rpc_server and stop_native_transport
without the lock.
std::set_difference requires the container to be sorted which is not
true here, use remove_if.
Do not use assert, use throw instead so that we can recover from this
error.
Currently error code is attached to a future returned by when_all() which
is never is exceptional one, but it may hold exceptional future as a
first element. Move error handling close to where error it tries to
catch is generated instead.
Let's move the code that prints that a compaction succeeded only
after the code that catches exception on either read or write
fibers. Let's also get rid of done and use repeat instead in
the read fiber.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If write timeout and last acknowledgement needed for CL happen simultaneously
_ready can be sent to be exceptional by the timeout handler, but since
removal of the response handler happens in continuation it may be
reordered with last ack processing and there _ready will be set again
which will trigger assert. Fix it by removing the handler immediately,
no need to wait for continuation. It makes code simpler too.
The get_cm_stats gets a pointer to a field in the stats object. It
should capture it by value or segmentation falut may occure when the
caller gets out of scope.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, we don't let the user know even what is the filename that failed.
That information should be included in the message.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This assert (in write fiber) would fail if read fiber failed
because the variable done will not be set to true.
The use of assert is very bad, because it prevents scylla
from proceeding, which is possible.
To solve it, let's trigger an exception if done is not true.
We do have code that will wait for both read and write fibers,
and catch exceptions, if any.
Closes#523.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since 4641dfff24, query_state keeps a
copy of client_state, not a reference. Therefore _cl is no longer
updated by queries using _qp. Fix by using the client_state from _qp.
Fixes#525.
All responses sent from the server have protocol version set to
connection::_version which is set to the version used by the client
in its first message. However, if the protocol version used by the
client is unuspported or invalide the server should use the latest
version it recognizes.
This solves problem with version negotiation with Java driver. The
driver first sends a request in the latest version it recognizes, if
that fails it retries with the version that server has used in the error
message. If that fails as well it gives up. However, since Scylla always
responds with the same version that the client has used the negotiation
always fails if the client supports more protocol version than the
server.
Refs #317.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Get initial tokens specified by the initial_token in scylla.conf.
E.g.,
--initial-token "-1112521204969569328,1117992399013959838"
--initial-token "1117992399013959838"
It can be multiple tokens split by comma.
"This series adds the missing functionality that the nodetool describering would work.
It import the missing functionality from origin.
After this patch the API:
GET /storage_service/describe_ring/{keyspace}
will be available"
This patch chanages the API to support describe ring instead of describe
ring jmx that will be implemented in the jmx server.
The API will return a list of objects instead of string.
An additional api was added as the equivelent to the jmx call with an
empty param.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the following methods implementation:
getRpcaddress
getRangeToAddressMap
getRangeToAddressMapInLocalDC
describeRing
getAllRanges
Those methods are used as part of the describe_ring method
implementation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The storage server uses the token_range in origin to return inforamtion
about the ring.
This import the structures. The functionality in origin is redundant in
this case and was not imported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Use all the disks except the one for rootfs for RAID0 which stores
scylla data. If only one disk is available warn the user since currently
our AMI's rootfs is not XFS.
[fedora@ip-172-31-39-189 ~]$ cat WARN.TXT
WARN: Scylla is not using XFS to store data. Performance will suffer.
Tested on AWS with 1 disk, 2 disks, 7 disk case.
(cherry picked from commit 49d6cba471)
Mistakenly didn't included on yum repository for AMI patchset, but it's needed
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
(cherry picked from commit 8587c4d6b3)
The nodetool cleanup command is used in many of the tests, because the
API call is not implemented it causes the tests to fail.
This is a workaround until the cleanup will be implemented, the method
return successfuly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Normally an API call that is not implemented should fail, there are
cases that as a workaround an API call is stub, in those cases a warning
is added to indicate that the API is not implemented.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch do the following:
It adds a getter for the completed respond messages (i.e. the total
messages that were sent by the server)
It replaces the return mapping for the statistics to use the key, value
notation that is used in the jmx side.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This adds the read repair statistics to he storage_proxy stats and adds
to its implementation incrementing the counters value.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The API needs to get the stats from the rpc server, that is hidden from the
messaging service API.
This patch adds a foreach function that goes over all the server stats
without exposing the server implementation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"The main objective of the series is to introduce statistics about ongoing
read/writes and especially those that are done in the background (acknowledged,
but uncompleted), but it contains some cleanups as well."
Add statistics for ongoing reads and ongoing background reads. Read is
a background one if it was acknowledged, but there still work to do to
complete it.
"Commit 4cd9c4c0c5441cf55e280c6f2f2e5529426b9c98 introduced a minor
issue: a wrong snitch instance may be used when updating a Gossiper state
(if I/O CPU is different from CPU0).
In order to fix this issue a local snitch instance on CPU0 should be used,
just like a Gossiper local instance.
We have to move some interfaces to i_endpoint_snitch
from being private in a gossiping_property_file_snitch in order to be
able to access it using snitch_ptr handle."
Don't ignore yet another returned future in reload_configuration().
Since commit 5e8037b50a
storage_service::gossip_snitch_info() returns a future.
This patch takes this into an account.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When we access a gossiper instance we use a _gossip_started
state of a snitch, which is set in a gossiper_starting() method.
gossiper_starting() method however is invoked by a gossiper on CPU0
only therefore the _gossip_started snitch state will be set for an
instance on CPU0 only.
Therefore instead of synchronizing the _gossip_started state between
all shards we just have to make sure we check it on the right CPU,
which is CPU0.
This patch fixes this issue.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adjust the interface and distribution of prefer_local parameter read
from a snitch property file with the rest of similar parameters (e.g. dc and rack):
they are read and their values are distributed (copied) across all shards'
instances.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Make reload_gossiper_state() be a virtual method
of a base class in order to allow calling it using a snitch_ptr
handle.
A base class already has a ton of virtual methods so no harm is
done performance-wise. Using virtual methods instead of doing
dynamic_cast results in a much cleaner code however.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Move the member and add an access method.
This is needed in order to be able to access this state using
snitch_ptr handle.
This also allows to get rid of ec2_multi_region_snitch::_helper_added
member since it duplicates _gossip_started semantics.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* seastar 9ae6407...258daf9 (6):
> rpc server: Add pending and sent messages to server
> scripts: posix_net_conf.sh: Use a generic logic for RPS configuring
> scripts: posix_net_conf.sh: allow passing a NIC name as a parameter
> doc: link to the tutorial
> tutorial: begin documenting the network API
> slab: remove bogus uintptr_t definition
"In 5e8037b50a (gossip: Futurize
add_local_application_state()) , we futurized add_local_application_state.
However, not all of the callers are futurized. Fix it up."
"- Fix snitch names from EC2XXX to Ec2XXX to align with configuration.
- Copy cassandra-rackdc.properties file to /var/lib/scylla/conf
- Set SCYLLA_HOME before booting process"
During testing build, the debugging statement at the end
of the function body (after return statements) causes compilation to
fail due to the flag -Werror=return-type:
service/storage_service.cc: In member function ‘future<> service::storage_service::clear_snapshot(sstring, std::vector<basic_sstring<char, unsigned int, 15u> >)’:
service/storage_service.cc:1358:1: error: control reaches end of non-void function [-Werror=return-type]
Which traces back to 21f84d77. Let's attach a then_wrapped()
clause to parallel_for_each() adding the debug message as
suggested by Avi.
CC: Glauber Costa <glommer@scylladb.com>
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
We are ignoring the future returned by seastar::async. Futurize it so
caller can wait for the application state to be actually applied.
In addition, dropping the unused add_local_application_states function.
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
"gossiping_property_file_snitch checks its property
file (cassandra-rackdc.properties) for changes every minute and
if there were changes it re-registers the helper and initiates
re-read of the new DC and Rack values in the corresponding places.
Therefore we need the ability to unregister/register the corresponding subscriber
at the same time when a subscriber list is possibly iterated by
some other asynchronous context on the current CPU.
The current gossiper implementation assumes that subscribers list may not be
changed from the context different from the one that iterates on their list.
So, this had to be fixed.
There was also missing an update_endpoint(ep) interface in the locator::topology
class and the corresponding token_metadata::update_topology(ep) wrapper.
Also there were some bugs in the gossiping_property_file::reload_configuration()
method."
On hindsight, it doesn't make much sense to print an
empty string, so let's only print stdout if it's non
None, non empty.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Do not hold the api lock while streaming the data since it might take a
long time, so we need to reconcile other operations while we are in the
middle of rebuild.
"description":"If the value is the string 'true' with any capitalization, repair only the first range returned by the partitioner.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"parallelism",
"description":"Repair parallelism, can be 0 (sequential), 1 (parallel) or 2 (datacenter-aware).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"incremental",
"description":"If the value is the string 'true' with any capitalization, perform incremental repair.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"jobThreads",
"description":"An integer specifying the parallelism on each node.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ranges",
"description":"An explicit list of ranges to repair, overriding the default choice. Each range is expressed as token1:token2, and multiple ranges can be given as a comma separated list.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"columnFamilies",
"description":"Which column families to repair in the given keyspace. Multiple columns families can be named separated by commas. If this option is missing, all column families in the keyspace are repaired.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"dataCenters",
"description":"Which data centers are to participate in this repair. Multiple data centers can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"hosts",
"description":"Which hosts are to participate in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1945,6 +2028,20 @@
}
}
},
"double_mapper":{
"id":"double_mapper",
"description":"A key value mapping between a string and a double",
"properties":{
"key":{
"type":"string",
"description":"The key"
},
"value":{
"type":"double",
"description":"The value"
}
}
},
"maplist_mapper":{
"id":"maplist_mapper",
"description":"A key value mapping, where key and value are list",
"Total space used for commitlogs. If the used space goes above this value, Cassandra rounds up to the next nearest segment multiple and flushes memtables to disk for the oldest commitlog segments, removing those log segments. This reduces the amount of data to replay on startup, and prevents infrequently-updated tables from indefinitely keeping commitlog segments. A small total commitlog space tends to cause more flush activity on less-active tables.\n" \
"Log WARN on any batch size exceeding this value in kilobytes. Caution should be taken on increasing the size of this threshold as it can lead to node instability." \
"The IP address a node tells other nodes in the cluster to contact it by. It allows public and private address to be different. For example, use the broadcast_address parameter in topologies where not all nodes have access to other nodes by their private IP addresses.\n" \
"If your Cassandra cluster is deployed across multiple Amazon EC2 regions and you use the EC2MultiRegionSnitch , set the broadcast_address to public IP address of the node and the listen_address to the private IP." \
) \
val(initial_token,sstring,/* N/A */,Unused, \
val(initial_token,sstring,/* N/A */,Used, \
"Used in the single-node-per-token architecture, where a node owns exactly one contiguous range in the ring space. Setting this property overrides num_tokens.\n" \
"If you not using vnodes or have num_tokens set it to 1 or unspecified (#num_tokens), you should always specify this parameter when setting up a production cluster for the first time and when adding capacity. For more information, see this parameter in the Cassandra 1.1 Node and Cluster Configuration documentation.\n" \
"This parameter can be used with num_tokens (vnodes ) in special cases such as Restoring from a snapshot." \
"RPC address to broadcast to drivers and other Cassandra nodes. This cannot be set to 0.0.0.0. If blank, it is set to the value of the rpc_address or rpc_interface. If rpc_address or rpc_interfaceis set to 0.0.0.0, this property must be set.\n" \
"Refresh interval for permissions cache (if enabled). After this interval, cache entries become eligible for refresh. On next access, an async reload is scheduled and the old value is returned until it completes. If permissions_validity_in_ms , then this property must benon-zero." \
"Enable or disable inter-node encryption. You must also generate keys and provide the appropriate key and trust store locations and passwords. No custom encryption options are currently enabled. The available options are:\n" \
"\n" \
"internode_encryption : (Default: none ) Enable or disable encryption of inter-node communication using the TLS_RSA_WITH_AES_128_CBC_SHA cipher suite for authentication, key exchange, and encryption of data transfers. The available inter-node options are:\n" \
@@ -690,20 +674,9 @@ public:
"\tnone : No encryption.\n" \
"\tdc : Encrypt the traffic between the data centers (server only).\n" \
"\track : Encrypt the traffic between the racks(server only).\n" \
"\tkeystore : (Default: conf/.keystore ) The location of a Java keystore (JKS) suitable for use with Java Secure Socket Extension (JSSE), which is the Java version of the Secure Sockets Layer (SSL), and Transport Layer Security (TLS) protocols. The keystore contains the private key used to encrypt outgoing messages.\n" \
"\tkeystore_password : (Default: cassandra ) Password for the keystore.\n" \
"\ttruststore : (Default: conf/.truststore ) Location of the truststore containing the trusted certificate for authenticating remote servers.\n" \
"\ttruststore_password : (Default: cassandra ) Password for the truststore.\n" \
"\n" \
"The passwords used in these options must match the passwords used when generating the keystore and truststore. For instructions on generating these files, see Creating a Keystore to Use with JSSE.\n" \
"certificate : (Default: conf/scylla.crt) The location of a PEM-encoded x509 certificate used to identify and encrypt the internode communication.\n" \
val(api_ui_dir,sstring,"swagger-ui/dist/",Used,"The directory location of the API GUI") \
val(api_doc_dir,sstring,"api/api-doc/",Used,"The API definition file directory") \
val(load_balance,sstring,"none",Used,"CQL request load balancing: 'none' or round-robin'") \
val(consistent_rangemovement,bool,true,Used,"When set to true, range movements will be consistent. It means: 1) it will refuse to bootstrapp a new node if other bootstrapping/leaving/moving nodes detected. 2) data will be streamed to a new node only from the node which is no longer responsible for the token range. Same as -Dcassandra.consistent.rangemovement in cassandra") \
val(join_ring,bool,true,Used,"When set to true, a node will join the token ring. When set to false, a node will not join the token ring. User can use nodetool join to initiate ring joinging later. Same as -Dcassandra.join_ring in cassandra.") \
val(load_ring_state,bool,true,Used,"When set to true, load tokens and host_ids previously saved. Same as -Dcassandra.load_ring_state in cassandra.") \
val(replace_node,sstring,"",Used,"The UUID of the node to replace. Same as -Dcassandra.replace_node in cssandra.") \
val(replace_token,sstring,"",Used,"The tokens of the node to replace. Same as -Dcassandra.replace_token in cassandra.") \
val(replace_address,sstring,"",Used,"The listen_address or broadcast_address of the dead node to replace. Same as -Dcassandra.replace_address.") \
val(replace_address_first_boot,sstring,"",Used,"Like replace_address option, but if the node has been bootstrapped sucessfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.") \
val(override_decommission,bool,false,Used,"Set true to force a decommissioned node to join the cluster") \
val(ring_delay_ms,uint32_t,30*1000,Used,"Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.") \
val(developer_mode,bool,false,Used,"Relax environement checks. Setting to true can reduce performance and reliability significantly.") \
throwstd::runtime_error("num_tokens must be >= 1");
}
// if (numTokens == 1)
// logger.warn("Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
if(num_tokens ==1){
logger.warn("Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.