Re: Large size KS management

2018-04-20 Thread Oleksandr Shulgin
On Fri, Apr 20, 2018 at 4:08 AM, Aiman Parvaiz  wrote:

> Hi all
>
> I have been given a 15 nodes C* 2.2.8 cluster to manage which has a large
> size KS (~800GB).
>

Is this per node or in total?


> Given the size of the KS most of the management tasks like repair take a
> long time to complete and disk space management is becoming tricky from the
> systems perspective.
>

Please quantify "long".  We had a 12 nodes 2.1 cluster with ~60 TB total
and one repair of the full ring (using cassandra-reaper) was taking about
3-4 weeks(!).  We were using default number of vnodes, 256.
Now we have migrated to 30 nodes on Cassandra version 3.0, with only 16
vnodes and the same full repair is under 5 days.

--
Alex


Re: Configuration parameter to reject incremental repair?

2018-08-20 Thread Oleksandr Shulgin
On Mon, Aug 13, 2018 at 1:31 PM kurt greaves  wrote:

> No flag currently exists. Probably a good idea considering the serious
> issues with incremental repairs since forever, and the change of defaults
> since 3.0.
>

Hi Kurt,

Did you mean since 2.2 (when incremental became the default one)?  Or was
there more to it that I'm not aware of?

Thanks,
--
Alex


Re: Adding new datacenter to the cluster

2018-08-20 Thread Oleksandr Shulgin
On Mon, Aug 13, 2018 at 3:50 PM Vitali Dyachuk  wrote:

> Hello,
> I'm going to follow this documentation to add a new datacenter to the C*
> cluster
>
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html
>
> The main step is to run nodetool rebuild which will sync data to the new
> datacenter,
> this will load cluster badly since the main keyspace size is 2TB.
> 1) What are the best practicies to add a new datacenter with a lot of data?
>

Hi,

If you fear overloading the source DC for rebuild, you can try starting
rebuild one node at a time on the target DC.  Better options exist for
throttling, see below.


> 2) How is it possible to stop rebuild?
>

You can stop rebuild on a single node by restarting Cassandra server
process.  Rebuild can be resumed by running `nodetool rebuild ...` again.


> 3) What are the throttling possibilities
>

nodetool setstreamingthroughput

Cheers,
--
Alex


Re: Extending Cassandra on AWS from single Region to Multi-Region

2018-08-20 Thread Oleksandr Shulgin
On Thu, Aug 9, 2018 at 3:46 AM srinivasarao daruna 
wrote:

> Hi All,
>
> We have built Cassandra on AWS EC2 instances. Initially when creating
> cluster we have not considered multi-region deployment and we have used AWS
> EC2Snitch.
>
> We have used EBS Volumes to save our data and each of those disks were
> filled around 350G.
> We want to extend it to Multi Region and wanted to know the better
> approach and recommendations to achieve this process.
>
> I agree that we have made a mistake by not using EC2MultiRegionSnitch, but
> its past now and if anyone faced or implemented similar thing i would like
> to get some guidance.
>
> Any help would be very much appreciated.
>

Hello,

As we did this successfully in the past, here are some notes from the field:

- configure the client applications to use address translation specific to
EC2 setup:
https://docs.datastax.com/en/developer/java-driver/3.3/manual/address_resolution/#ec2-multi-region

- either specify the 'datacenter' name the client should consider as a
local in the DCAwareRoundRobinPolicy() or provide private IP addresses of
the local DC as contact points.  This should ensure that the clients don't
try to connect to the new DC which doesn't have the data yet.

- review the consistency levels the client uses: use LOCAL_ONE and
LOCAL_QUORUM instead of ONE/QUORUM for reads and writes, use EACH_QUORUM
for writes when you want to ensure stronger consistency cross-region.

- switching from plain EC2Snitch to EC2MultiRegionSnitch will change node's
broadcast address to its public IP.  Make sure that other nodes (in the
same region and remote region) can connect on the public IP.

Hope this helps,
--
Alex


Re: [Cassandra] nodetool compactionstats not showing pending task.

2018-08-22 Thread Oleksandr Shulgin
On Fri, May 5, 2017 at 1:20 PM Alain RODRIGUEZ  wrote:

> Sorry to hear the restart did not help.
>

Hi,

We are hitting the same issue since a few weeks on version 3.0.16.
Normally, restarting an affected node helps, but this is something we would
like to avoid doing.

What makes it worse for us is that Cassandra Reaper stops scheduling new
repair jobs if it sees that the node have more than 20 pending compaction
tasks.  We could bump this threshold, but in general the estimate could be
more accurate (or the actual tasks should be started timely).

Maybe try to monitor through JMX with
'org.apache.cassandra.db:type=CompactionManager',
>> attribute 'Compactions' or 'CompactionsSummary'
>
>
> What is this attribute showing?
>

For example, I have right now a node showing "pending tasks: 16" and no
compaction running.  Here is the JMX output (well, via Jolokia):

"Compactions": [],
"CoreCompactorThreads": 1,
"CompactionSummary": [],
"MaximumCompactorThreads": 1,
"CoreValidationThreads": 1,
"MaximumValidatorThreads": 2147483647,

Here is the Apache Cassandra Jira:
> https://issues.apache.org/jira/browse/CASSANDRA. You search here
> 
>  (
> https://issues.apache.org/jira/browse/CASSANDRA-12529?jql=project%20%3D%20CASSANDRA%20AND%20text%20~%20%22pending%20compactions%22%20ORDER%20BY%20created%20DESC),
> for example.
>

I believe this is a separate issue.  There some actual compaction tasks are
running, but not making progress.  And we never TRUNCATEd our tables, and
for sure not recently.

Any more pointers on how to debug this?

Regards,
--
Alex


Re: Repairs are slow after upgrade to 3.11.3

2018-08-29 Thread Oleksandr Shulgin
On Wed, Aug 29, 2018 at 3:06 AM Maxim Parkachov 
wrote:

> couple of days ago I have upgraded Cassandra from 3.11.2 to 3.11.3 and I
> see that repair time is practically doubled. Does someone else experience
> the same regression ?
>

We have upgraded from 3.0.16 to 3.0.17 two days ago and we see the same
symptom.  We are using Cassandra reaper and average time to repair one
segment increased from 5-6 to 10-12 min.

--
Alex


Re: Recommended num_tokens setting for small cluster

2018-08-29 Thread Oleksandr Shulgin
On Thu, Aug 30, 2018 at 12:05 AM kurt greaves  wrote:

> For 10 nodes you probably want to use between 32 and 64. Make sure you use
> the token allocation algorithm by specifying allocate_tokens_for_keyspace
>

We are using 16 tokens with 30 nodes on Cassandra 3.0.  And yes, we have
used allocate_tokens_for_keyspace option to achieve better load
distribution than with the random allocation (which is the default).
Currently we see the disk usage between 1.5 and 1.7TB, which is acceptable
variance for us.

If you're using DSE, you're more lucky because it's easier to bootstrap new
DC with the smart token allocation algorithm.  Simply because the parameter
you need to specify does not depend on any keyspaces being replicated to
the new nodes, you just specify the target replication factor to optimize
for.

Cheers,
--
Alex


Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-03 Thread Oleksandr Shulgin
On Mon, Sep 3, 2018 at 10:41 AM onmstester onmstester 
wrote:

> I'm going to add more 6 nodes to my cluster (already has 4 nodesand RF=2)
> using  GossipingPropertyFileSnitch, and *NetworkTopologyStrategy and
> default num_tokens = 256.*
> It recommended to join nodes one by one, although there is < 200GB on each
> node, i will do so.
> In the document mentioned that i should run nodetool cleanup after joining
> a new node:
>  *Run* *nodetool cleanup* *on the source node and on neighboring nodes
> that shared the same subrange after the new node is up and running. Failure
> to run this command after adding a node causes Cassandra to include the old
> data to rebalance the load on that node*
> It also mentioned that
>
> *Cleanup can be safely postponed for low-usage hours.*
> Should i run nodetool cleanup on each node, after adding every node?
> (considering that cleanup too should be done one-by-one , it would be a lot
> of tasks to do! ) is it possible to run clean-up once (after all new nodes
> joined the cluster) on all the nodes?
>

Hi,

It makes a lot of sense to run cleanup once after you have added all the
new nodes.

> I also don't understand the part for:
> allocate_tokens_for_local_replication_factor
> ,
> i didn't change num_tokes:256 and anything related to vnode config in yaml
> conf and load already distributed evenly (is this a good approach and good
> num_tokens, while i'm using nodes with same spec?), so should i consider
> this config ( allocate_tokens_for_local_replication_factor
> )
> while adding new node having a single keyspace with RF=2?
>
I would not recommend touching these while adding nodes to an existing
ring.  You might want to have another look if you add a new DC.  Then pick
smaller number of vnodes and use the smart allocation option.

Cheers,
--
Alex


Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-03 Thread Oleksandr Shulgin
On Mon, Sep 3, 2018 at 12:19 PM onmstester onmstester 
wrote:

> What i have understood from this part of document is that, when i already
> have node A,B and C in cluster  there would be some old data on A,B,C after
> new node D joined the cluster completely which is data streamed to D, then
> if i add node E to the cluster immediately, the old data on A,B,C would be
> also moved between nodes everytime?
>

Potentially, when you add node E it takes ownership of some of the data
that D has.  So you have to run cleanup on all (except the very last node
you add) in the end.  It still makes sense to do this once, not after every
single node you add.

--
Alex


Re: nodetool cleanup - compaction remaining time

2018-09-06 Thread Oleksandr Shulgin
On Thu, Sep 6, 2018 at 11:50 AM Alain RODRIGUEZ  wrote:

>
> Be aware that this behavior happens when the compaction throughput is set
> to *0 *(unthrottled/unlimited). I believe the estimate uses the speed
> limit for calculation (which is often very much wrong anyway).
>

As far as I can remember, if you have unthrottled compaction, then the
message is different: it says "n/a".  The all zeroes you usually see when
you only have Validation compactions, and apparently Cleanup work the same
way, at least in the 2.1 version.

https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/tools/nodetool/CompactionStats.java#L102

Actually, if you look closely, it's obvious that only real Compaction tasks
count toward remainingBytes, so all Validation/Clenaup/Upgrade don't
count.  The reason must be that only actual compaction is affected by the
throttling parameter.  Is that assumption correct?

In any case it would make more sense to measure the actual throughput to
provide an accurate estimate.  Not sure if there is JIRA issue for that
already.

--
Alex


Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread Oleksandr Shulgin
On Sat, 8 Sep 2018, 14:47 Jonathan Haddad,  wrote:

> 256 tokens is a pretty terrible default setting especially post 3.0.  I
> recommend folks use 4 tokens for new clusters,
>

I wonder why not setting it to all the way down to 1 then? What's the key
difference once you have so few vnodes?

with some caveats.
>

And those are?

When you fire up a cluster, there's no way to make the initial tokens be
> distributed evenly, you'll get random ones.  You'll want to set them
> explicitly using:
>
> python -c 'print( [str(((2**64 / 4) * i) - 2**63) for i in range(4)])'
>
>
> After you fire up the first seed, create a keyspace using RF=3 (or
> whatever you're planning on using) and set allocate_tokens_for_keyspace to
> that keyspace in your config, and join the rest of the nodes.  That gives
> even distribution.
>

Do you possibly know if the DSE-style option which doesn't require a
keyspace to be there also works to allocate evenly distributed tokens for
the very first seed node?

Thanks,
--
Alex


Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread Oleksandr Shulgin
On Sat, 8 Sep 2018, 19:00 Jeff Jirsa,  wrote:

> Virtual nodes accomplish two primary goals
>
> 1) it makes it easier to gradually add/remove capacity to your cluster by
> distributing the new host capacity around the ring in smaller increments
>
> 2) it increases the number of sources for streaming, which speeds up
> bootstrap and decommission
>
> Whether or not either of these actually is true depends on a number of
> factors, like your cluster size (for #1) and your replication factor (for
> #2). If you have 4 hosts and 4 tokens per host and add a 5th host, you’ll
> probably add a neighbor near each existing host (#1) and stream from every
> other host (#2), so that’s great. If you have 20 hosts and add a new host
> with 4 tokens, most of your existing ranges won’t change at all - you’re
> nominally adding 5% of your cluster capacity but you won’t see a 5%
> improvement because you don’t have enough tokens to move 5% of your ranges.
> If you had 32 tokens, you’d probably actually see that 5% improvement,
> because you’d likely add a new range near each of the existing ranges.
>

Jeff,

I'm a bit lost here: are you referring to streaming speed improvement or
cluster capacity increase?

Going down to 1 token would mean you’d probably need to manually move
> tokens after each bootstrap to rebalance, which is fine, it just takes more
> operator awareness.
>

Right. This is then the old story before vnodes that you can only scale out
and keep balanced cluster if you double the number of nodes. Or you can
move the tokens.

What's not clear to me is why 4 tokens (as opposed to only 1) should be
enough for adding small number of nodes and keeping the balance.

Assuming we have 3 racks, we would add 3 nodes at a time for scaling out.
With 4 tokens we split only 12 ranges across the ring this way. I would
think it depends on the current cluster size, but empirically the load skew
at first gets worse (for middle-sized clusters) and then probably is
cancelled out for bigger sizes. Did anyone tried to do the actual math for
this?

I don’t know how DSE calculates which replication factor to use for their
> token allocation logic, maybe they guess or take the highest or something.
> Cassandra doesn’t - we require you to be explicit, but we could probably do
> better here.
>

I believe that DSE also doesn't calculate it--you specify the RF to
optimize for in the config. At least their config parameter is called
allocate_tokens_for_local_replication_factor:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configVnodes.html

That being said, I have never used DSE, hence was my question.

Cheers,
--
Alex


Drop TTLd rows: upgradesstables -a or scrub?

2018-09-10 Thread Oleksandr Shulgin
Hello,

We have some tables with significant amount of TTLd rows that have expired
by now (and more gc_grace_seconds have passed since the TTL).  We have
stopped writing more data to these tables quite a while ago, so background
compaction isn't running.  The compaction strategy is the default
SizeTiered one.

Now we would like to get rid of all the droppable tombstones in these
tables.  What would be the approach that puts the least stress on the
cluster?

We've considered a few, but the most promising ones seem to be these two:
`nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
version 3.0.

Now, this docs page recommends to use upgradesstables wherever possible:
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
What is the reason behind it?

>From source code I can see that Scrubber the class which is going to drop
the tombstones (and report the total number in the logs):
https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308

I couldn't find similar handling in the upgradesstables code path.  Is the
assumption correct that this one will not drop the tombstone as a side
effect of rewriting the files?

Any drawbacks of using scrub for this task?

Thanks,
-- 
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
Services | Zalando SE | Tel: +49 176 127-59-707


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-10 Thread Oleksandr Shulgin
On Mon, 10 Sep 2018, 19:29 Charulata Sharma (charshar),
 wrote:

> Scrub takes a very long time and does not remove the tombstones.
>
Charu,

Why is that if the documentation clearly says it does?

> should do garbage cleaning. It immediately removes the tombstones.
>
If you mean 'nodetool garbagecollect' - that command is not available in
the version we are using. It only became available in 3.10.

--
Alex

>
> *From: *Oleksandr Shulgin 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, September 10, 2018 at 6:53 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Drop TTLd rows: upgradesstables -a or scrub?
>
>
>
> Hello,
>
>
>
> We have some tables with significant amount of TTLd rows that have expired
> by now (and more gc_grace_seconds have passed since the TTL).  We have
> stopped writing more data to these tables quite a while ago, so background
> compaction isn't running.  The compaction strategy is the default
> SizeTiered one.
>
>
>
> Now we would like to get rid of all the droppable tombstones in these
> tables.  What would be the approach that puts the least stress on the
> cluster?
>
>
>
> We've considered a few, but the most promising ones seem to be these two:
> `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
> version 3.0.
>
>
>
> Now, this docs page recommends to use upgradesstables wherever possible:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>
> What is the reason behind it?
>
>
>
> From source code I can see that Scrubber the class which is going to drop
> the tombstones (and report the total number in the logs):
> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>
>
>
> I couldn't find similar handling in the upgradesstables code path.  Is the
> assumption correct that this one will not drop the tombstone as a side
> effect of rewriting the files?
>
>
>
> Any drawbacks of using scrub for this task?
>
>
>
> Thanks,
> --
>
> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
> Services | Zalando SE | Tel: +49 176 127-59-707
>
>
>


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-10 Thread Oleksandr Shulgin
On Mon, 10 Sep 2018, 19:40 Jeff Jirsa,  wrote:

> I think it's important to describe exactly what's going on for people who
> just read the list but who don't have context. This blog does a really good
> job:
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
> , but briefly:
>
> - When a TTL expires, we treat it as a tombstone, because it may have been
> written ON TOP of another piece of live data, so we need to get that
> deletion marker to all hosts, just like a manual explicit delete
> - Tombstones in sstable A may shadow data in sstable B, so doing anything
> on just one sstable MAY NOT remove the tombstone - we can't get rid of the
> tombstone if sstable A overlaps another sstable with the same partition
> (which we identify via bloom filter) that has any data with a lower
> timestamp (we don't check the sstable for a shadowed value, we just look at
> the minimum live timestamp of the table)
>
> "nodetool garbagecollect" looks for sstables that overlap (partition keys)
> and combine them together, which makes tombstones past GCGS purgable and
> should remove them (and data shadowed by them).
>
> If you're on a version without nodetool garbagecollection, you can
> approximate it using user defined compaction (
> http://thelastpickle.com/blog/2016/10/18/user-defined-compaction.html ) -
> it's a JMX endpoint that let's you tell cassandra to compact one or more
> sstables together based on parameters you choose. This is somewhat like
> upgradesstables or scrub, but you can combine sstables as well. If you
> choose candidates intelligently (notably, oldest sstables first, or
> sstables you know overlap), you can likely manually clean things up pretty
> quickly. At one point, I had a jar that would do single sstable at a time,
> oldest sstable first, and it pretty much worked for this purpose most of
> the time.
>
> If you have room, a "nodetool compact" on stcs will also work, but it'll
> give you one huge sstable, which will be unfortunate long term (probably
> less of a problem if you're no longer writing to this table).
>

That's a really nice refresher, thanks Jeff!

>From the nature of the data at hand and because of the SizeTiered
compaction, I would expect that more or less all tables do overlap with
each other.

Even if we would be able to identify the overlapping ones (how?), I expect
that we would have to do an equivalent of the major compaction, but (maybe)
in multiple stages. Not sure that's really worth the trouble for us.

Thanks,
--
Alex

On Mon, Sep 10, 2018 at 10:29 AM Charulata Sharma (charshar)
>  wrote:
>
>> Scrub takes a very long time and does not remove the tombstones. You
>> should do garbage cleaning. It immediately removes the tombstones.
>>
>>
>>
>> Thaks,
>>
>> Charu
>>
>>
>>
>> *From: *Oleksandr Shulgin 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Monday, September 10, 2018 at 6:53 AM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Drop TTLd rows: upgradesstables -a or scrub?
>>
>>
>>
>> Hello,
>>
>>
>>
>> We have some tables with significant amount of TTLd rows that have
>> expired by now (and more gc_grace_seconds have passed since the TTL).  We
>> have stopped writing more data to these tables quite a while ago, so
>> background compaction isn't running.  The compaction strategy is the
>> default SizeTiered one.
>>
>>
>>
>> Now we would like to get rid of all the droppable tombstones in these
>> tables.  What would be the approach that puts the least stress on the
>> cluster?
>>
>>
>>
>> We've considered a few, but the most promising ones seem to be these two:
>> `nodetool scrub` or `nodetool upgradesstables -a`.  We are using Cassandra
>> version 3.0.
>>
>>
>>
>> Now, this docs page recommends to use upgradesstables wherever possible:
>> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsScrub.html
>>
>> What is the reason behind it?
>>
>>
>>
>> From source code I can see that Scrubber the class which is going to drop
>> the tombstones (and report the total number in the logs):
>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/Scrubber.java#L308
>>
>>
>>
>> I couldn't find similar handling in the upgradesstables code path.  Is
>> the assumption correct that this one will not drop the tombstone as a side
>> effect of rewriting the files?
>>
>>
>>
>> Any drawbacks of using scrub for this task?
>>
>>
>>
>> Thanks,
>> --
>>
>> Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
>> Services | Zalando SE | Tel: +49 176 127-59-707
>>
>>
>>
>


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-11 Thread Oleksandr Shulgin
On Mon, Sep 10, 2018 at 10:03 PM Jeff Jirsa  wrote:

> How much free space do you have, and how big is the table?
>

So there are 2 tables, one is around 120GB and the other is around 250GB on
every node.  On the node with the most free disk space we still have around
500GB available and on the node with the least free space: 300GB.

So if I understand it correctly, we could still do major compaction while
keeping STCS and we should not hit 100% disk space, if we first compact one
of the tables, and then the other (we expect quite some free space to
become available due to to all those TTL tombstones being removed in the
process).

Is there any real drawback of having a single big SSTable in our case where
we never going to append more data to the table?

Switching to LCS is another option.
>

Hm, this is interesting idea.  The expectation should be that even if we
don't remove 100% of the tombstones, we should be able to get rid of 90%
them on the highest level, right?  And if we would have less space
available, using LCS could make progress by re-organizing the partitions in
smaller increments, so we could still do it if we had less free space than
the smallest table?

Cheers,
--
Alex


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-11 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> As far as I remember, in newer Cassandra versions, with STCS, nodetool
> compact offers a ‘-s’ command-line option to split the output into files
> with 50%, 25% … in size, thus in this case, not a single largish SSTable
> anymore. By default, without -s, it is a single SSTable though.
>

Thanks Thomas, I've also spotted the option while testing this approach.  I
understand that doing major compactions is generally not recommended, but
do you see any real drawback of having a single SSTable file in case we
stopped writing new data to the table?

--
Alex


Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

2018-09-11 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>> compact offers a ‘-s’ command-line option to split the output into files
>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>> anymore. By default, without -s, it is a single SSTable though.
>>
>
> Thanks Thomas, I've also spotted the option while testing this approach.
> I understand that doing major compactions is generally not recommended, but
> do you see any real drawback of having a single SSTable file in case we
> stopped writing new data to the table?
>

A related question is: given that we are not writing new data to these
tables, it would make sense to exclude them from the routine repair
regardless of the option we use in the end to remove the tombstones.

However, I've just checked the timestamps of the SSTable files on one of
the nodes and to my surprise I can find some files written only a few weeks
ago (most of the files are half a year ago, which is expected because it
was the time we were adding this DC).  But we've stopped writing to the
tables about a year ago and we repair the cluster very week.

What could explain that we suddenly see these new SSTable files?  They
shouldn't be there even due to overstreaming, because one would need to
find some differences in the Merkle tree in the first place, but I don't
see how that could actually happen in our case.

Any ideas?

Thanks,
--
Alex


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-11 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>> compact offers a ‘-s’ command-line option to split the output into files
>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>> anymore. By default, without -s, it is a single SSTable though.
>>
>
> Thanks Thomas, I've also spotted the option while testing this approach.
>

Yet another surprising aspect of using `nodetool compact` is that it
triggers major compaction on *all* nodes in the cluster at the same time.
I don't see where this is documented and this was contrary to my
expectation.  Does this behavior make sense to anyone?  Is this a bug?  The
version is 3.0.

--
Alex


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-11 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 11:07 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

>
> a single (largish) SSTable or any other SSTable for a table, which does
> not get any writes (with e.g. deletes) anymore, will most likely not be
> part of an automatic minor compaction anymore, thus may stay forever on
> disk, if I don’t miss anything crucial here.
>

I would also expect that, but that's totally fine for us.


> Might be different though, if you are entirely writing TTL-based, cause
> single SSTable based automatic tombstone compaction may kick in here, but
> I’m not really experienced with that.
>

Yes, we were writing with a TTL of 2 years to these tables, and in about 1
years from now 100% of the data in them will expire.  We would be able to
simply truncate them at that point.

Now that you mention single-SSTable tombstone compaction again, I don't
think this is happening in our case.  For example, on one of the nodes I
see estimated droppable tombstones ratio range from 0.24 to slightly over 1
(1.09).  Yet, no single-SSTable compaction was triggered apparently,
because the data files are all 6 months old now.  We are using all the
default settings for tombstone_threshold, tombstone_compaction_interval
and unchecked_tombstone_compaction.

Does this mean that these all SSTable files do indeed overlap and because
we don't allow unchecked_tombstone_compaction, no actual compaction is
triggered?

We had been suffering a lot with storing timeseries data with STCS and disk
> capacity to have the cluster working smoothly and automatic minor
> compactions kicking out aged timeseries data according to our retention
> policies in the business logic. TWCS is unfortunately not an option for us.
> So, we did run major compactions every X weeks to reclaim disk space, thus
> from an operational perspective, by far not nice. Thus, finally decided to
> change STCS min_threshold from default 4 to 2, to let minor compactions
> kick in more frequently. We can live with the additional IO/CPU this is
> causing, thus is our current approach to disk space and sizing issues we
> had in the past.
>

For our new generation of tables we have switched to use TWCS, that's the
reason we don't write anymore to those old tables which are still using
STCS.

Cheers,
--
Alex


Re: Drop TTLd rows: upgradesstables -a or scrub?

2018-09-11 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 10:04 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

>
> Yet another surprising aspect of using `nodetool compact` is that it
> triggers major compaction on *all* nodes in the cluster at the same time.
> I don't see where this is documented and this was contrary to my
> expectation.  Does this behavior make sense to anyone?  Is this a bug?  The
> version is 3.0.
>

Whoops, taking back this one.  It was me who triggered the compaction on
all nodes at the same time.  Trying to do too many things at the same time.
:(

--
Alex


Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

2018-09-11 Thread Oleksandr Shulgin
On Tue, 11 Sep 2018, 19:26 Jeff Jirsa,  wrote:

> Repair or read-repair
>

Jeff,

Could you be more specific please?

Why any data would be streamed in if there is no (as far as I can see)
possibilities for the nodes to have inconsistency?

--
Alex

On Tue, Sep 11, 2018 at 12:58 AM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Tue, Sep 11, 2018 at 9:47 AM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Tue, Sep 11, 2018 at 9:31 AM Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
>>>> As far as I remember, in newer Cassandra versions, with STCS, nodetool
>>>> compact offers a ‘-s’ command-line option to split the output into files
>>>> with 50%, 25% … in size, thus in this case, not a single largish SSTable
>>>> anymore. By default, without -s, it is a single SSTable though.
>>>>
>>>
>>> Thanks Thomas, I've also spotted the option while testing this
>>> approach.  I understand that doing major compactions is generally not
>>> recommended, but do you see any real drawback of having a single SSTable
>>> file in case we stopped writing new data to the table?
>>>
>>
>> A related question is: given that we are not writing new data to these
>> tables, it would make sense to exclude them from the routine repair
>> regardless of the option we use in the end to remove the tombstones.
>>
>> However, I've just checked the timestamps of the SSTable files on one of
>> the nodes and to my surprise I can find some files written only a few weeks
>> ago (most of the files are half a year ago, which is expected because it
>> was the time we were adding this DC).  But we've stopped writing to the
>> tables about a year ago and we repair the cluster very week.
>>
>> What could explain that we suddenly see these new SSTable files?  They
>> shouldn't be there even due to overstreaming, because one would need to
>> find some differences in the Merkle tree in the first place, but I don't
>> see how that could actually happen in our case.
>>
>> Any ideas?
>>
>> Thanks,
>> --
>> Alex
>>
>>


Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

2018-09-17 Thread Oleksandr Shulgin
On Tue, Sep 11, 2018 at 8:10 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, 11 Sep 2018, 19:26 Jeff Jirsa,  wrote:
>
>> Repair or read-repair
>>
>
> Could you be more specific please?
>
> Why any data would be streamed in if there is no (as far as I can see)
> possibilities for the nodes to have inconsistency?
>

Again, given that the tables are not updated anymore from the application
and we have repaired them successfully multiple times already, how can it
be that any inconsistency would be found by read-repair or normal repair?

We have seen this on a number of nodes, including SSTables written at the
time there was guaranteed no repair running.

Regards,
--
Alex


Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

2018-09-17 Thread Oleksandr Shulgin
On Mon, Sep 17, 2018 at 4:04 PM Jeff Jirsa  wrote:

> Again, given that the tables are not updated anymore from the application
> and we have repaired them successfully multiple times already, how can it
> be that any inconsistency would be found by read-repair or normal repair?
>
> We have seen this on a number of nodes, including SSTables written at the
> time there was guaranteed no repair running.
>
> Not obvious to me where the sstable is coming from - you’d have to look in
> the logs. If it’s read repair, it’ll be created during a memtable flush. If
> it’s nodetool repair, it’ll be streamed in. It could also be compaction
> (especially tombstone compaction), in which case it’ll be in the compaction
> logs and it’ll have an sstable ancestor in the metadata.
>

Jeff,

Thanks for your reply!  Indeed it could be coming from single-SSTable
compaction, this I didn't think about.  By any chance looking into
compaction_history table could be useful to trace it down?

--
Alex


Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?)

2018-09-18 Thread Oleksandr Shulgin
On Mon, Sep 17, 2018 at 4:41 PM Jeff Jirsa  wrote:

> Marcus’ idea of row lifting seems more likely, since you’re using STCS -
> it’s an optimization to “lift” expensive reads into a single sstable for
> future reads (if a read touches more than - I think - 4? sstables, we copy
> it back into the memtable so it’s flushed into a single sstable), so if you
> have STCS and you’re still doing reads, it could definitely be that.
>

A-ha, that's eye-opening: it could definitely be that.  Thanks for
explanation!

--
Alex


Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

2018-09-18 Thread Oleksandr Shulgin
On Mon, Sep 17, 2018 at 4:29 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

>
> Thanks for your reply!  Indeed it could be coming from single-SSTable
> compaction, this I didn't think about.  By any chance looking into
> compaction_history table could be useful to trace it down?
>

Hello,

Yet another unexpected thing we are seeing is that after a major compaction
completed on one of the nodes there are two SSTables instead of only one
(time is UTC):

-rw-r--r-- 1 999 root 99G Sep 18 00:13 mc-583-big-Data.db -rw-r--r-- 1 999
root 70G Mar 8 2018 mc-74-big-Data.db

The more recent one must be the result of major compaction on this table,
but why the other one from March was not included?

The logs don't help to understand the reason, and from compaction history
on this node the following record seems to be the only trace:

@ Row 1
---+--
 id| b6feb180-bad7-11e8-9f42-f1a67c22839a
 bytes_in  | 223804299627
 bytes_out | 105322622473
 columnfamily_name | XXX
 compacted_at  | 2018-09-18 00:13:48+
 keyspace_name | YYY
 rows_merged   | {1: 31321943, 2: 11722759, 3: 382232, 4: 23405, 5:
2250, 6: 134}

This also doesn't tell us a lot.

This has happened only on one node out of 10 where the same command was
used to start major compaction on this table.

Any ideas what could be the reason?

For now we have just started major compaction again to ensure these two
last tables are compacted together, but we would really like to understand
the reason for this behavior.

Regards,
--
Alex


Re: Major compaction ignoring one SSTable? (was Re: Fresh SSTable files (due to repair?) in a static table (was Re: Drop TTLd rows: upgradesstables -a or scrub?))

2018-09-18 Thread Oleksandr Shulgin
On Tue, Sep 18, 2018 at 10:38 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

>
> any indications in Cassandra log about insufficient disk space during
> compactions?
>

Bingo!  The following was logged around the time compaction was started
(and I only looked around when it was finishing):

Not enough space for compaction, 284674.12MB estimated.  Reducing scope.

That still leaves a question why the estimate doesn't take into account the
tombstones which will be dropped in the process.  Because actually it takes
only slightly more than 100GB in the end, as seen on the other nodes.

Thanks, Thomas!
--
Alex


TWCS + subrange repair = excessive re-compaction?

2018-09-24 Thread Oleksandr Shulgin
Hello,

Our setup is as follows:

Apache Cassandra: 3.0.17
Cassandra Reaper: 1.3.0-BETA-20180830
Compaction: {
   'class': 'TimeWindowCompactionStrategy',
   'compaction_window_size': '30',
   'compaction_window_unit': 'DAYS'
 }

We have two column families which differ only in the way data is written:
one is always with a TTL (of 2 years), the other -- without a TTL.  The
data is time-series-like, append-only, no explicit updates or deletes.  The
data goes back as far as ~15 months.

We have scheduled a non-incremental repair using Cassandra Reaper to run
every week.

Now we are observing an unexpected effect such that often *all* of the
SSTable files on disk are modified (touched by repair) for both of the TTLd
and non-TTLd tables.

This is not expected, since the old files from past months have been
repeatedly repaired a number of times already.

If it is an effect caused by over-streaming, why does Cassandra find any
differences in the files from past months in the first place?  We expect
that after a file from 2 months ago (or earlier) has been fully repaired
once, there is no possibility for any more differences to be discovered.

Is this not a reasonable assumption?

Regards,,
-- 
Alex


Re: TWCS + subrange repair = excessive re-compaction?

2018-09-24 Thread Oleksandr Shulgin
On Mon, Sep 24, 2018 at 10:50 AM Jeff Jirsa  wrote:

> Do your partitions span time windows?


Yes.

--
Alex


Re: TWCS + subrange repair = excessive re-compaction?

2018-09-24 Thread Oleksandr Shulgin
On Mon, 24 Sep 2018, 13:08 Jeff Jirsa,  wrote:

> The data structure used to know if data needs to be streamed (the merkle
> tree) is only granular to - at best - a token, so even with subrange repair
> if a byte is off, it’ll stream the whole partition, including parts of old
> repaired sstables
>
> Incremental repair is smart enough not to diff or stream already repaired
> data, the but the matrix of which versions allow subrange AND incremental
> repair isn’t something I’ve memorized (I know it behaves the way you’d hope
> in trunk/4.0 after Cassandra-9143)
>

Cool, thanks for explaining, Jeff!

--
Alex


Odd CPU utilization spikes on 1 node out of 30 during repair

2018-09-26 Thread Oleksandr Shulgin
Hello,

On our production cluster of 30 Apache Cassandra 3.0.17 nodes we have
observed that only one node started to show about 2 times the CPU
utilization as compared to the rest (see screenshot): up to 30% vs. ~15% on
average for the other nodes.

This started more or less immediately after repair was started (using
Cassandra Reaper, parallel, non-incremental) and lasted up until we've
restarted this node.  After restart the CPU use is in line with the rest of
nodes.

All other metrics that we are monitoring for these nodes were in line with
the rest of the cluster.

The logs on the node don't show anything odd, no extra warn/error/info
messages, not more minor or major GC runs as compared to other nodes during
the time we were observing this behavior.

What could be the reason for this behavior?  How should we debug it if that
happens next time instead of just restarting?

Cheers,
--
Alex

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Odd CPU utilization spikes on 1 node out of 30 during repair

2018-09-26 Thread Oleksandr Shulgin
On Wed, Sep 26, 2018 at 1:07 PM Anup Shirolkar <
anup.shirol...@instaclustr.com> wrote:

>
> Looking at information you have provided, the increased CPU utilisation
> could be because of repair running on the node.
> Repairs are resource intensive operations.
>
> Restarting the node should have halted repair operation getting the CPU
> back to normal.
>

The repair was running on all nodes at the same time, still only one node
had CPU significantly different from the rest of the nodes.
As I've mentioned: we are running non-incremental parallel repair using
Cassandra Reaper.
After the node was restarted, new repair tasks were given to it by Reaper
and it was doing repair as previously, but this time
without exposing the odd behavior.

In some cases, repairs trigger additional operations e.g. compactions,
> anti-compactions
> These operations could cause extra CPU utilisation.
> What is the compaction strategy used on majority of keyspaces ?
>

For the 2 tables involved in this regular repair we are using
TimeWindowCompactionStrategy with time windows of 30 days.

Talking about CPU utilisation *percentage*, although it has doubled but the
> increase is 15%.
> It would be interesting to know the number of CPU cores on these nodes to
> judge the absolute increase in CPU utilisation.
>

All nodes are using the same hardware on AWS EC2: r4.xlarge, they have 4
vCPUs.

You should try to find the root cause behind the behaviour and decide
> course of action.
>

Sure, that's why I was asking for ideas how to find the root cause. :-)

Effective use monitoring, logs can help you identify the root cause.
>

As I've mentioned, we do have monitoring and I've checked the logs, but
that didn't help to identify the issue so far.

Regards,
--
Alex


Re: Odd CPU utilization spikes on 1 node out of 30 during repair

2018-09-27 Thread Oleksandr Shulgin
On Thu, Sep 27, 2018 at 2:24 AM Anup Shirolkar <
anup.shirol...@instaclustr.com> wrote:

>
> Most of the things look ok from your setup.
>
> You can enable Debug logs for repair duration.
> This will help identify if you are hitting a bug or other cause of unusual
> behaviour.
>
> Just a remote possibility, do you have other things running on nodes
> besides Cassandra.
> Do they consume additional CPU at times.
> You can check per process CPU consumption to keep an eye on non-Cassandra
> processes.
>

That's a good point.  These instances are dedicated to run Cassandra, so
that we didn't think to check any other processes might be the cause...
But of course, there are some additional processes (like metrics exporter
and log shipping agents), but they normally do not contribute to CPU
utilization in any visible amount.

Cheers,
--
Alex


Re: Re: how to configure the Token Allocation Algorithm

2018-10-01 Thread Oleksandr Shulgin
On Mon, Oct 1, 2018 at 12:18 PM onmstester onmstester 
wrote:

>
> What if instead of running that python and having one node with non-vnode
> config, i remove the first seed node and re-add it after cluster was fully
> up ? so the token ranges of first seed node would also be assigned by
> Allocation Alg
>

I think this is tricky because the random allocation of the very first
tokens from the first seed affects the choice of tokens made by the
algorithm on the rest of the nodes: it basically tries to divide the token
ranges in more or less equal parts.  If your very first 8 tokens resulted
in really bad balance, you are not going to remove that imbalance by
removing the node, it would still have the lasting effect on the rest of
your cluster.

--
Alex


Re: TWCS: Repair create new buckets with old data

2018-10-19 Thread Oleksandr Shulgin
On Fri, Oct 19, 2018 at 10:23 AM Jeff Jirsa  wrote:

> It depends on your yaml settings - in newer versions you can have
> cassandra only purge repaired tombstones (and ttl’d data is a tombstone)
>

Interesting.  Which setting is that?  Is it 4.0 or 3.x -- I couldn't find
anything similar in the 3.x yaml example...

--
Alex


Re: TWCS: Repair create new buckets with old data

2018-10-19 Thread Oleksandr Shulgin
On Fri, Oct 19, 2018 at 11:04 AM Jeff Jirsa  wrote:

>
> I’m mobile and can’t check but it’s this JIRA
>
> https://issues.apache.org/jira/browse/CASSANDRA-6434
>
> And it may be a table level prop, I suppose. Again, I’m not in a position
> to confirm.
>

Indeed, it's called only_purge_repaired_tombstones and is the column family
option for STCS, so should also work for TWCS, if I understand it
correctly.  It's available since 3.0 and is off by default.

--
Alex


Re: snapshot strategy?

2018-11-02 Thread Oleksandr Shulgin
On Fri, Nov 2, 2018 at 5:15 PM Lou DeGenaro  wrote:

> I'm looking to hear how others are coping with snapshots.
>
> According to the doc:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupDeleteSnapshot.html
>
> *When taking a snapshot, previous snapshot files are not automatically
> deleted. You should remove old snapshots that are no longer needed.*
>
> *The nodetool clearsnapshot
> 
> command removes all existing snapshot files from the snapshot directory of
> each keyspace. You should make it part of your back-up process to clear old
> snapshots before taking a new one.*
>
> But if you delete first, then there is a window of time when no snapshot
> exists until the new one is created.  And with a single snapshot there is
> no recovery further back than it.
>
You can also delete specific snapshot, by passing its name to the
clearsnapshot command.  For example, you could use snapshot date as part of
the name.  This will also prevent removing snapshots which were taken for
reasons other than backup, like the automatic snapshot due to running
TRUNCATE or DROP commands, or any other snapshots which might have been
created manually by the operators.

Regards,
--
Alex


Re: upgrading from 2.x TWCS to 3.x TWCS

2018-11-04 Thread Oleksandr Shulgin
On Sat, Nov 3, 2018 at 1:13 AM Brian Spindler 
wrote:

> That wasn't horrible at all.  After testing, provided all goes well I can
> submit this back to the main TWCS repo if you think it's worth it.
>
> Either way do you mind just reviewing briefly for obvious mistakes?
>
>
> https://github.com/bspindler/twcs/commit/7ba388dbf41b1c9dc1b70661ad69273b258139da
>

About almost a year ago we were migrating from 2.1 to 3.0 and we figured
out that Jeff's master branch didn't compile with 3.0, but the change to
get it running was really minimal:
https://github.com/a1exsh/twcs/commit/10ee91c6f409aa249c8d439f7670d8b997ab0869

So we built that jar, added it to the packaged 3.0 and we were good to go.
Maybe you might want to consider migrating in two steps: 2.1 -> 3.0, ALTER
TABLE, upgradesstables, 3.0 -> 3.1.

And huge thanks to Jeff for coming up with TWCS in the first place! :-)

Cheers,
--
Alex


Re: Jepsen testing

2018-11-09 Thread Oleksandr Shulgin
On Thu, Nov 8, 2018 at 10:42 PM Yuji Ito  wrote:

>
> We are working on Jepsen testing for Cassandra.
> https://github.com/scalar-labs/jepsen/tree/cassandra/cassandra
>
> As you may know, Jepsen is a framework for distributed systems
> verification.
> It can inject network failure and so on and check data consistency.
> https://github.com/jepsen-io/jepsen
>
> Our tests are based on riptano's great work.
> https://github.com/riptano/jepsen/tree/cassandra/cassandra
>
> I refined it for the latest Jepsen and removed some tests.
> Next, I'll fix clock-drift tests.
>
> I would like to get your feedback.
>

Cool stuff!  Do you have jepsen tests as part of regular testing in
scalardb?  How long does it take to run all of them on average?

I wonder if Apache Cassandra would be willing to include this as part of
regular testing drill as well.

Cheers,
--
Alex


Re: system_auth keyspace replication factor

2018-11-26 Thread Oleksandr Shulgin
On Fri, Nov 23, 2018 at 5:38 PM Vitali Dyachuk  wrote:

>
> We have recently met a problem when we added 60 nodes in 1 region to the
> cluster
> and set an RF=60 for the system_auth ks, following this documentation
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>

Sadly, this recommendation is out of date / incorrect.  For `system_auth`
we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no
issues.

Is there a chance to correct the documentation @datastax?

Regards,
--
Alex


Re: Problem with restoring a snapshot using sstableloader

2018-11-30 Thread Oleksandr Shulgin
On Fri, Nov 30, 2018 at 5:13 PM Oliver Herrmann 
wrote:

>
> I'm always getting the message "Skipping file mc-11-big-Data.db: table
> snapshots.table3 doesn't exist". I also tried to rename the snapshots
> folder into the keyspace name (cass_testapp) but then I get the message
> "Skipping file mc-11-big-Data.db: table snap1.snap1. doesn't exist".
>

Hi,

I imagine moving the files from snapshot directory to the data directory
and then running `nodetool refresh` is the supported way.  Why use
sstableloader for that?

--
Alex


Re: Problem with restoring a snapshot using sstableloader

2018-12-01 Thread Oleksandr Shulgin
On Fri, 30 Nov 2018, 17:54 Oliver Herrmann  When using nodetool refresh I must have write access to the data folder
> and I have to do it on every node. In our production environment the user
> that would do the restore does not have write access to the data folder.
>

OK, not entirely sure that's a reasonable setup, but do you imply that with
sstableloader you don't need to process every snapshot taken -- that is,
also visiting every node? That would only be true if your replication
factor equals to the number of nodes, IMO.

--
Alex


Re: Problem with restoring a snapshot using sstableloader

2018-12-03 Thread Oleksandr Shulgin
On Mon, Dec 3, 2018 at 4:24 PM Oliver Herrmann 
wrote:

>
> You are right. The number of nodes in our cluster is equal to the
> replication factor. For that reason I think it should be sufficient to call
> sstableloader only from one node.
>

The next question is then: do you care about consistency of data restored
from one snapshot?  Is the snapshot taken after repair?  Do you still write
to those tables?

In other words, your data will be consistent after restoring from one
node's snapshot only if you were writing with consistency level ALL (or
equal to your replication factor and, transitively, to the number of nodes).

-- 
Oleksandr "Alex" Shulgin | Senior Software Engineer | Team Flux | Data
Services | Zalando SE | Tel: +49 176 127-59-707


Sporadic high IO bandwidth and Linux OOM killer

2018-12-05 Thread Oleksandr Shulgin
Hello,

We are running the following setup on AWS EC2:

Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux  4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.

Apache Cassandra 3.0.17, installed from .deb packages.

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

We have a total of 36 nodes.  All are r4.large instances, they have 2 vCPUs
and ~15 GB RAM.
On each instance we have:
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).

Non-default settings in cassandra.yaml:
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch

JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.

We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node.  The issue also happens outside of repair time.

The symptoms:


Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours.  According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
the data/commitlog volume.  Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.

The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.

Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.

Starting from the middle of last week we have seen this problem repeatedly
1-3 times a day, affecting different instances in a seemingly random
fashion.  Most of the time it affects only one instance, but we've had one
incident when 9 nodes (3 from each of the 3 Availability Zones) were down
at the same time due to this exact issue.

Actually, we've had the same issue previously on the same Cassandra cluster
around 3 months ago (beginning to mid of September 2018).  At that time we
were running on m4.xlarge instances (these have 4 vCPUs and 16GB RAM).

As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large.  The last migration was
completed before Nov 13th.  No changes to the cluster or application
happened since that time.

Now, after some weeks the issue appears again...

When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra.  After that the instance becomes available
almost immediately.  This allows us to rule out other processes running in
background as potential offenders.

We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia).  At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.

We are running atop with sampling interval of 60 seconds.  After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period.  We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth.  There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).

We also attempted to troubleshoot by running jstack, jmap and pmap against
Cassandra process in background every few minutes.  The idea was to compare
dumps taken before and during unavailability, but that didn't lead to any
findings so far.  Ultimately we had to stop doing this, once we've seen
that jmap can also become stuck burning CPU cycles.  Now the output of jmap
is not that useful, but we fear that jstack might also expose the same
behavior.  So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.

Now to my questions:

1. Is there anything in Cassandra or in the JVM that could explain all of a
sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?

2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?

3. Why Committed virtual memory is so much bigger than sum of the heap and
off-heap reported by JMX?  To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and

Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-05 Thread Oleksandr Shulgin
On Wed, 5 Dec 2018, 19:53 Jonathan Haddad  Seeing high kswapd usage means there's a lot of churn in the page cache.
> It doesn't mean you're using swap, it means the box is spending time
> clearing pages out of the page cache to make room for the stuff you're
> reading now.
>

Jon,

Thanks for your thoughts!

machines don't have enough memory - they are way undersized for a
> production workload.
>

Well, they were doing fine since around February this year. The issue
started to appear out of the blue sky.

Things that make it worse:
> * high readahead (use 8kb on ssd)
> * high compression chunk length when reading small rows / partitions.
> Nobody specifies this, 64KB by default is awful.  I almost always switch to
> 4KB-16KB here but on these boxes you're kind of screwed since you're
> already basically out of memory.
>

That's interesting, even though from my understanding Cassandra is mostly
doing sequential IO.  But I'm not sure this is really relevant to the issue
at hand, as reading is done from the root device.

What could it be reading from there? After the JVM has started up and
config file is parsed I really don't see why should it read anything
additionally. Or am I missing something?

To make it clear: normally the root EBS on these nodes is doing at most 10
reads per second. When the issue starts, reads per second jump to hundreds
within few minutes (sometimes there's a preceding period of slow build up,
but in the end it's really exponential).

I'd never put Cassandra in production with less than 30GB ram and 8 cores
> per box.
>

We had to tweak the heap size once we started to run repairs, because
default heuristic aimed too low for us. Otherwise, as I've said we've seen
zero problems with our workload.

Cheers,
--
Alex


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-05 Thread Oleksandr Shulgin
On Wed, 5 Dec 2018, 19:34 Riccardo Ferrari  Hi Alex,
>
> I saw that behaviout in the past.
>

Riccardo,

Thank you for the reply!

Do you refer to kswapd issue only or have you observed more problems that
match behavior I have described?

I can tell you the kswapd0 usage is connected to the `disk_access_mode`
> property. On 64bit systems defaults to mmap.
>

Hm, that's interesting, I will double-check.

That also explains why your virtual memory is so high (it somehow matches
> the node load, right?).
>

Not sure what do you mean by "load" here. We have a bit less than 1.5TB per
node on average.

Regards,
--
Alex


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-06 Thread Oleksandr Shulgin
On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari  wrote:

>
> I had few instances in the past that were showing that unresponsivveness
> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
> on a single thread processing (full throttle) for seconds. According to
> iotop that was the kswapd0 process. That system was an ubuntu 16.04
> actually "Ubuntu 16.04.4 LTS".
>

Riccardo,

Did you by chance also observe Linux OOM?  How long did the
unresponsiveness last in your case?

>From there I started to dig what kswap process was involved in a system
> with no swap and found that is used for mmapping. This erratic (allow me to
> say erratic) behaviour was not showing up when I was on 3.0.6 but started
> to right after upgrading to 3.0.17.
>
> By "load" I refer to the load as reported by the `nodetool status`. On my
> systems, when disk_access_mode is auto (read mmap), it is the sum of the
> node load plus the jmv heap size. Of course this is just what I noted on my
> systems not really sure if that should be the case on yours too.
>

I've checked and indeed we are using disk_access_mode=auto (well,
implicitly because it's not even part of config file anymore):
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.

I hope someone with more experience than me will add a comment about your
> settings. Reading the configuration file, writers and compactors should be
> 2 at minimum. I can confirm when I tried in the past to change the
> concurrent_compactors to 1 I had really bad things happenings (high system
> load, high message drop rate, ...)
>

As I've mentioned, we did not observe any other issues with the current
setup: system load is reasonable, no dropped messages, no big number of
hints, request latencies are OK, no big number of pending compactions.
Also during repair everything looks fine.

I have the "feeling", when running on constrained hardware the underlaying
> kernel optimization is a must. I agree with Jonathan H. that you should
> think about increasing the instance size, CPU and memory mathters a lot.
>

How did you solve your issue in the end?  You didn't rollback to 3.0.6?
Did you tune kernel parameters?  Which ones?

Thank you!
--
Alex


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-07 Thread Oleksandr Shulgin
On Thu, Dec 6, 2018 at 3:39 PM Riccardo Ferrari  wrote:

> To be honest I've never seen the OOM in action on those instances. My Xmx
> was 8GB just like yours and that let me think you have some process that is
> competing for memory, is it? Do you have any cron, any backup, anything
> that can trick the OOMKiller ?
>

Riccardo,

As I've mentioned previously, apart from docker running Cassandra on JVM,
there is a small number of houskeeping processes, namely cron to trigger
log rotation, a log shipping agent, node metrics exporter (prometheus) and
some other small things.  None of those come close in their memory
requirements compared to Cassandra and are routinely pretty low in memory
usage reports from atop and similar tools.  The overhead of these seems to
be minimal.

My unresponsiveness was seconds long. This is/was bad becasue gossip
> protocol was going crazy by marking nodes down and all the consequences
> this can lead in distributed system, think about hints, dynamic snitch, and
> whatever depends on node availability ...
> Can you share some number about your `tpstats` or system load in general?
>

Here's some pretty typical tpstats output from one of the nodes:

Pool NameActive   Pending  Completed   Blocked  All
time blocked
MutationStage 0 0  319319724 0
   0
ViewMutationStage 0 0  0 0
   0
ReadStage 0 0   80006984 0
   0
RequestResponseStage  0 0  258548356 0
   0
ReadRepairStage   0 02707455 0
   0
CounterMutationStage  0 0  0 0
   0
MiscStage 0 0  0 0
   0
CompactionExecutor1551552918 0
   0
MemtableReclaimMemory 0 0   4042 0
   0
PendingRangeCalculator0 0111 0
   0
GossipStage   0 06343859 0
   0
SecondaryIndexManagement  0 0  0 0
   0
HintsDispatcher   0 0226 0
   0
MigrationStage0 0  0 0
   0
MemtablePostFlush 0 0   4046 0
   0
ValidationExecutor1 1   1510 0
   0
Sampler   0 0  0 0
   0
MemtableFlushWriter   0 0   4042 0
   0
InternalResponseStage 0 0   5890 0
   0
AntiEntropyStage  0 0   5532 0
   0
CacheCleanupExecutor  0 0  0 0
   0
Repair#2501 1  1 0
   0
Native-Transport-Requests 2 0  260447405 0
  18

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
HINT 0
MUTATION 1
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0

Speaking of CPU utilization, it is consistently within 30-60% on all nodes
(and even less in the night).


> On the tuning side I just went through the following article:
> https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html
>
> No rollbacks, just moving forward! Right now we are upgrading the instance
> size to something more recent than m1.xlarge (for many different reasons,
> including security, ECU and network).Nevertheless it might be a good idea
> to upgrade to the 3.X branch to leverage on better off-heap memory
> management.
>

One thing we have noticed very recently is that our nodes are indeed
running low on memory.  It even seems now that the IO is a side effect of
impending OOM, not the other way round as we have thought initially.

After a fresh JVM start the memory allocation looks roughly like this:

 total   used   free sharedbuffers cached
Mem:   14G14G   173M   1.1M12M   3.2G
-/+ buffers/cache:11G   3.4G
Swap:   0B 0B 0B

Then, within a number of days, the allocated disk cache shrinks all the way
down to unreasonable numbers like only 150M.  At the same time "free" stays
at the original level and "used" grows all the way up to 14G.  Shortly
after that the node becomes unavailable because of the IO and ultimately
afte

Re: AWS r5.xlarge vs i3.xlarge

2018-12-10 Thread Oleksandr Shulgin
On Mon, Dec 10, 2018 at 12:20 PM Riccardo Ferrari 
wrote:

> I am wondering what instance type is best for a small cassandra cluster on
> AWS.
>

Define "small" :-D


> Actually I'd like to compare, or have your opinion about the following
> instances:
>
>- r5*d*.xlarge (4vCPU, *19*ecu, 32GB ram and 1 NVMe instance store
>150GB)
>   - Need to attach a 600/900GB ESB
>   - i3.xlarge (4vCPU, *13ecu, *30.5GB ram and 9.5TB NVMe instance
>store)
>
> Both have up to 10Gb networking.
> I see AWS mark i3 as the NoSQL DB instances nevertheless r5d seems bit
> better CPU wise. Putting a decently sized gp2 EBS I should have enough IOPS
> especially we think to put commitlog and such on the 150GB NVMe storage.
> About the workload: mostly TWCS inserts and upserts on LCS.
>

So there is a number of trade-offs:

1. With EBS you have more flexibility when it comes to scaling compute
power: you don't have to rebuild data directory from scratch.  At the same
time, EBS performance can be limited by the volume itself (it depends on
volume type *and* size), and it can also be limited by instance type.  You
might not be able to reach max throughput of a big volume with a small
instance attached.

2. I didn't try to run Cassandra with i2 or i3 instances.  These are
optimized for a lot of random IO, though with Cassandra what you should be
seeing is mostly sequential IO, so I'm not sure you're going to utilize the
NVMes fully.  Some AWS features, like auto-recovery, only work with
instances using EBS-backed storage exclusively.

Cheers,
--
Alex


Re: AWS r5.xlarge vs i3.xlarge

2018-12-10 Thread Oleksandr Shulgin
On Mon, Dec 10, 2018 at 3:23 PM Riccardo Ferrari  wrote:

>
> By "small" I mean that currently I have a 6x m1.xlarge instances running
> Cassandra 3.0.17. Total amount of data is around 1.5TB spread across couple
> of keypaces wih RF:3.
>
> Over time few things happened/became clear including:
>
>- increase amount of ingested data
>- m1.xlarge instances are somehow outdated. We noted that one of them
>is under performing compared to the others. Networking is not always
>stable/reliable and so on
>- Upgrading from 3.0.6 to 3.0.17 emphasized the need of better
>hardware even more (in my opinion).
>
> Starting from here I believe that i3/r5d are already a much better option
> to what we have with a comparable price.
>
> About the EBS: Yes, I am aware its performance is related to its size (and
> type) That is the reason why I was looking into a 600/900GB drive that
> already a much better option compared to our raid0 of spinning disks. Both
> i3 and r5d are EBS optimized
>

True, but pay attention to the fine print (from
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html#ebs-optimization-support
):

* These instance types can support maximum performance for 30 minutes at
> least once every 24 hours...


So if you check the *baseline* performance of r5d.xlarge (which also holds
for i3.xlarge) you will see up to 106.25 MB/s throughput and up to 6000
IOPS.  That's already a lot, but you should still consider that to have a
complete picture.

--
Alex


Re: Re: How to gracefully decommission a highly loaded node?

2018-12-17 Thread Oleksandr Shulgin
On Mon, Dec 17, 2018 at 11:44 AM Riccardo Ferrari 
wrote:

> I am having "the same" issue.
> One of my nodes seems to have some hardware struggle, out of 6 nodes (same
> instance size) this one is likely to be makred down, it consntantly
> compacting, high system load, it's just a big pain.
>
> My idea was to add nodes and decommission all the one running on old
> hardware (m1.xlarge), however this very specific "bad" node is causing
> trouble to the whole cluster and decided to decommission it first.
>
> The node is simply stuck in "LEAVING" - Not sending any stream. I already
> have disabled binary and autocompactions and tried to restart the
> decommission process couple of times with no luck.
> Any suggestions?
> assassinate vs removenode?
> Any tuning that could help?
>

If it's stuck that badly, then I would consider it lost and just do
replacenode.  Hope it's not too late if you started to decommission?

Cheers,
--
Alex


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-28 Thread Oleksandr Shulgin
On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

>
> After a fresh JVM start the memory allocation looks roughly like this:
>
>  total   used   free sharedbuffers cached
> Mem:   14G14G   173M   1.1M12M   3.2G
> -/+ buffers/cache:11G   3.4G
> Swap:   0B 0B 0B
>
> Then, within a number of days, the allocated disk cache shrinks all the
> way down to unreasonable numbers like only 150M.  At the same time "free"
> stays at the original level and "used" grows all the way up to 14G.
> Shortly after that the node becomes unavailable because of the IO and
> ultimately after some time the JVM gets killed.
>
> Most importantly, the resident size of JVM process stays at around 11-12G
> all the time, like it was shortly after the start.  How can we find where
> the rest of the memory gets allocated?  Is it just some sort of malloc
> fragmentation?
>

For the ones following along at home, here's what we ended up with so far:

0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
symptoms are gone.  Our bill is dominated by the price EBS storage, so this
is much less than 2x increase in total.

1. We've noticed that increased memory usage correlates with the number of
SSTables on disk.  When the number of files on disk decreases, available
memory increases.  This leads us to think that extra memory allocation is
indeed due to use of mmap.  Not clear how we could account for that.

2. Improved our monitoring to include number of files (via total - free
inodes).

Given the cluster's resource utilization, it still feels like r4.large
would be a good fit, if only we could figure out those few "missing" GB of
RAM. ;-)

Cheers!
--
Alex


Re: How seed nodes are working and how to upgrade/replace them?

2019-01-07 Thread Oleksandr Shulgin
On Mon, Jan 7, 2019 at 3:37 PM Jonathan Ballet  wrote:

>
> I'm working on how we could improve the upgrades of our servers and how to
> replace them completely (new instance with a new IP address).
> What I would like to do is to replace the machines holding our current
> seeds (#1 and #2 at the moment) in a rolling upgrade fashion, on a regular
> basis:
>
> * Is it possible to "promote" any non-seed node as a seed node?
>
> * Is it possible to "promote" a new seed node without having to restart
> all the nodes?
>   In essence, in my example that would be:
>
>   - decide that #2 and #3 will be the new seed nodes
>   - update all the configuration files of all the nodes to write the IP
> addresses of #2 and #3
>   - DON'T restart any node - the new seed configuration will be picked up
> only if the Cassandra process restarts
>

You can provide a custom implementation of the seed provider protocol:
org.apache.cassandra.locator.SeedProvider

We were exploring that approach few years ago with etcd, which I think
provides capabilities similar to that of Consul:
https://github.com/a1exsh/cassandra-etcd-seed-provider/blob/master/src/main/java/org/zalando/cassandra/locator/EtcdSeedProvider.java

We are not using this anymore, but for other reasons (namely, being too
optimistic about putting Cassandra cluster into an AWS AutoScaling Group).
The SeedProvider itslef seem to have worked as we have expected.

Hope this helps,
--
Alex


How to upgrade logback dependency

2019-02-12 Thread Oleksandr Shulgin
Hi,

The latest release notes for all versions mention that logback < 1.2.0 is
subject to CVE-2017-5929 and that the logback version is not upgraded.
E.g:
https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-3.0.18

Indeed, when installing 3.0.18 from the deb package I still see the older
version:

# ls -l /usr/share/cassandra/lib/logback*
-rw-r--r-- 1 root root 280926 Feb  1 18:37
/usr/share/cassandra/lib/logback-classic-1.1.3.jar
-rw-r--r-- 1 root root 455041 Feb  1 18:37
/usr/share/cassandra/lib/logback-core-1.1.3.jar

Given that I can install a newer logback version, for example, using apt-get
install liblogback (which currently pulls 1.2.3), how do I make sure
Cassandra uses the newer one?

Should I put the newer jars on CLASSPATH before starting the server?
Examining /usr/share/cassandra/cassandra.in.sh suggests that this is likely
to do the trick, but is this the way to go or is there a better way?
Didn't find this documented anywhere.

Regards,
-- 
Alex


Re: How to upgrade logback dependency

2019-02-13 Thread Oleksandr Shulgin
On Tue, Feb 12, 2019 at 7:02 PM Michael Shuler 
wrote:

> If you are not using the logback SocketServer and ServerSocketReceiver
> components, the CVE doesn't affect your server with logback 1.1.3.
>

So the idea is that as long as logback.xml doesn't configure any of the
above, we are fine with the current logback version?

Thanks,
--
Alex


Re: forgot to run nodetool cleanup

2019-02-13 Thread Oleksandr Shulgin
On Wed, Feb 13, 2019 at 5:31 AM Jeff Jirsa  wrote:

> The most likely result of not running cleanup is wasted disk space.
>
> The second most likely result is resurrecting deleted data if you do a
> second range movement (expansion, shrink, etc).
>
> If this is bad for you, you should run cleanup now. For many use cases,
> it’s a nonissue.
>
> If you know you’re going to add more hosts, be very sure you run cleanup
> before you do so.
>

Jeff,

Could you please expand a little?  Do you mean that adding new hosts can
lead to deleted data resurrection if cleanup isn't done prior to that?

I would only expect this to be a potential problem if one removes nodes,
since then range ownership can expand, but not with adding nodes, as then
ownership can only shrink.  Or am I missing something bigger?

--
Alex


Re: forgot to run nodetool cleanup

2019-02-13 Thread Oleksandr Shulgin
On Wed, Feb 13, 2019 at 4:40 PM Jeff Jirsa  wrote:

> Some people who add new hosts rebalance the ring afterward - that
> rebalancing can look a lot like a shrink.
>

You mean by moving the tokens?  That's only possible if one is not using
vnodes, correct?

I also believe, but don’t have time to prove, that enough new hosts can
> eventually give you a range back (moving it all the way around the ring) -
> less likely but probably possible.
>
> Easiest to just assume that any range movement may resurrect data if you
> haven’t run cleanup.
>

Does this mean that it is recommended to run cleanup on all hosts after
every single node added?  We currently do this after every 3 or 6 nodes (1
or 2 new per rack), to minimize the number of times we have to rewrite the
sstable files.  Arguably, we don't do explicit deletes, the data is only
expiring due to TTL, so this should not be a problem for us, but in general?

--
Alex


Re: forgot to run nodetool cleanup

2019-02-14 Thread Oleksandr Shulgin
On Wed, Feb 13, 2019 at 6:47 PM Jeff Jirsa  wrote:

> Depending on how bad data resurrection is, you should run it for any host
> that loses a range. In vnodes, that's usually all hosts.
>
> Cleanup with LCS is very cheap. Cleanup with STCS/TWCS is a bit more work.
>

Wait, doesn't cleanup just rewrite every SSTable one by one?  Why would
compaction strategy matter?  Do you mean that after cleanup STCS may pick
some resulting tables to re-compact them due to the min/max size
difference, which would not be the case with LCS?


> If you're just TTL'ing all data, it may not be worth the effort.
>

Indeed, but in our case the main reason to scale out is that the nodes are
running out of disk space, so we really want to get rid of the extra copies.

--
Alex


Re: forgot to run nodetool cleanup

2019-02-14 Thread Oleksandr Shulgin
On Thu, Feb 14, 2019 at 4:39 PM Jeff Jirsa  wrote:
>
> Wait, doesn't cleanup just rewrite every SSTable one by one?  Why would
compaction strategy matter?  Do you mean that after cleanup STCS may pick
some resulting tables to re-compact them due to the min/max size
difference, which would not be the case with LCS?
>
>
> LCS has smaller, non-overlapping files. The upleveling process and
non-overlapping part makes it very likely (but not guaranteed) that within
a level, only 2 sstables will overlap a losing range.
>
> Since cleanup only rewrites files if they’re out of range, LCS probably
only has 5 (levels) * 2 (lower and upper) * number of ranges sstables that
are going to get rewritten, where TWCS / stcs is probably going to rewrite
all of them.

Thanks for the explanation!

Still with the default number of vnodes, there is probably not much of a
difference as even a single additional node will touch a lot of ranges?

--
Alex


Re: Question on changing node IP address

2019-02-26 Thread Oleksandr Shulgin
On Tue, Feb 26, 2019 at 9:39 AM wxn...@zjqunshuo.com 
wrote:

>
> I'm running 2.2.8 with vnodes and I'm planning to change node IP address.
> My procedure is:
> Turn down one node, setting auto_bootstrap to false in yaml file, then
> bring it up with -Dcassandra.replace_address. Repeat the procedure one by
> one for the other nodes.
>
> I care about streaming because the data is very large and if there is
> streaming, it will take a long time. When the node with new IP be brought
> up, will it take over the token range it has before? I expect no token
> range reassignment and no streaming. Am I right?
>
> Any thing I need care about when making IP address change?
>

Changing the IP address of a node does not require special considerations.
After restart with the new address the server will notice it and log a
warning, but it will keep token ownership as long as it keeps the old host
id (meaning it must use the same data directory as before restart).

At the same time, *do not* use the replace_address option: it assumes empty
data directory and will try to stream data from other replicas into the
node.

--
Alex


Re: [EXTERNAL] Re: Question on changing node IP address

2019-02-26 Thread Oleksandr Shulgin
On Tue, Feb 26, 2019 at 3:26 PM Durity, Sean R 
wrote:

> This has not been my experience. Changing IP address is one of the worst
> admin tasks for Cassandra. System.peers and other information on each nodes
> is stored by ip address. And gossip is really good at sending around the
> old information mixed with new…
>

Hm, on which version was it?  I might be biased not having worked with
anything but 3.0 since recently.

--
Alex


Re: Question on changing node IP address

2019-02-26 Thread Oleksandr Shulgin
On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com 
wrote:

> >After restart with the new address the server will notice it and log a
> warning, but it will keep token ownership as long as it keeps the old host
> id (meaning it must use the same data directory as before restart).
>
> Based on my understanding, token range is binded to host id. As long as
> host id doesn't change, everything is ok. Besides data directory, any other
> thing can lead to host id change? And how host id is caculated? For
> example, if I upgrade Cassandra binary to a new version, after restart,
> will host id change?
>

I believe host id is calculated once the new node is initialized and never
changes afterwards, even through major upgrades.  It is stored in system
keyspace in data directory, and is stable across restarts.

--
Alex


Re: [EXTERNAL] Re: Question on changing node IP address

2019-02-27 Thread Oleksandr Shulgin
On Wed, Feb 27, 2019 at 3:11 PM Durity, Sean R 
wrote:

> We use the PropertyFileSnitch precisely because it is the same on every
> node. If each node has to have a different file (for GPFS) – deployment is
> more complicated. (And for any automated configuration you would have a
> list of hosts and DC/rack information to compile anyway)
>
>
>
> I do put UNKNOWN as the default DC so that any missed node easily appears
> in its own unused DC.
>

Alright, it obviously makes a lot of difference which snitch to use.  We
are deploying to EC2, so we are using the EC2 snitches at all times.  I
guess some complexity is hidden from us by these custom implementations.

At the same time, we do try to assign IP addresses in a predictable manner
when deploying a new cluster, in order to fix the list of seed nodes in
advance (we wouldn't care about the rest of nodes).

So I think, for the original question: careful when changing IP address of
seeds nodes.  Probably you want to start with non-seeds and promote some of
them to seeds before you start changing IP addresses of the old seeds.

--
Alex


Re: About using Ec2MultiRegionSnitch

2019-03-06 Thread Oleksandr Shulgin
On Tue, Mar 5, 2019 at 2:24 PM Jeff Jirsa  wrote:

> Ec2 multi should work fine in one region, but consider using
> GossipingPropertyFileSnitch if there’s even a chance you’ll want something
> other than AWS regions as dc names - multicloud, hybrid, analytics DCs, etc
>

For the record, DC names can be adjusted separately by using
cassandra-rackdc.properties file, without going away from EC2 snitches.
Which doesn't give you full control, but is good enough for setting up
analytical DC or for cross-DC migrations while staying within the same AWS
region.

--
Alex


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-26 Thread Oleksandr Shulgin
On Mon, Mar 25, 2019 at 11:13 PM Carl Mueller
 wrote:

>
> Since the internal IPs are given when the client app connects to the
> cluster, the client app cannot communicate with other nodes in other
> datacenters.
>

Why should it?  The client should only connect to its local data center and
leave communication with remote DCs to the query coordinator.


> They seem to be able to communicate within its own datacenter of the
> initial connection.
>

Did you configure address translation on the client?  See:
https://docs.datastax.com/en/developer/java-driver/3.0/manual/address_resolution/#ec2-multi-region

It appears we fixed this by manually updating the system.peers table's
> rpc_address column back to the public IP. This appears to survive a restart
> of the cassandra nodes without being switched back to private IPs.
>

I don't think updating system tables is a supported solution.  I'm
surprised that even doesn't give you an error.

Our cassandra.yaml (these parameters are the same in our confs for 2.1 and
> 2.2) has:
>
> listen_address: internal aws vpc ip
> rpc_address: 0.0.0.0
> broadcast_rpc_address: internal aws vpc ip
>

It is not straightforward to find the docs for version 2.x anymore, but at
least for 3.0 it is documented that you should set broadcast_rpc_address to
the public IP:
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchEC2MultiRegion.html

Regards,
--
Alex


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-26 Thread Oleksandr Shulgin
On Tue, Mar 26, 2019 at 5:49 PM Carl Mueller
 wrote:

> Looking at the code it appears it shouldn't matter what we set the yaml
> params to. The Ec2MultiRegionSnitch should be using the aws metadata
> 169.254.169.254 to pick up the internal/external ips as needed.
>

This is somehow my expectation as well, so maybe the docs are just outdated.

I think I'll just have to dig in to the code differences between 2.1 and
> 2.2. We don't want to specify the glboal IP in any of the yaml fields
> because the global IP for the instance changes if we do an aws instance
> restart. Don't want yaml editing to be a part of the instance restart
> process.
>

We did solve this in the past by using Elastic IPs: anything prevents you
from using those?

--
Alex


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-27 Thread Oleksandr Shulgin
On Tue, Mar 26, 2019 at 10:28 PM Carl Mueller
 wrote:

> - the AWS people say EIPs are a PITA.
>

Why?


> - if we hardcode the global IPs in the yaml, then yaml editing is required
> for the occaisional hard instance reboot in aws and its attendant global ip
> reassignment
> - if we try leaving broadcast_rpc_address blank, null , or commented out
> with rpc_address set to 0.0.0.0 then cassandra refuses to start
>

Yeah, that's not nice.

- if we take out rpc_address and broadcast_rpc_address, then cqlsh doesn't
> work with localhost anymore and that fucks up some of our cluster
> managemetn tooling
>
> - we kind of are being lazy and just want what worked in 2.1 to work in 2.2
>

Makes total sense to me.

I'll try to track down where cassandra startup is complaining to us about
> rpc_address: 0.0.0.0 and broadcast_rpc_address being blank/null/commented
> out. That section of code may need an exception for EC2MRS.
>

It sounds like this check is done before instantiating the snitch and it
should be other way round, so that the snitch can have a chance to adjust
the configuration before it's checked for correctness.  Do you have the
exact error message with which it complains?

--
Alex


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-28 Thread Oleksandr Shulgin
On Wed, Mar 27, 2019 at 6:36 PM Carl Mueller
 wrote:

>
> EIPs per the aws experts cost money,
>

>From what I know they only cost you when you're not using them.  This page
shows that you are also charged if you remap them too often (more then 100
times a month), this I didn't realize:
https://aws.amazon.com/ec2/pricing/on-demand/#Elastic_IP_Addresses

are limited in resources (we have a lot of VMs) and cause a lot of
> headaches in our autoscaling / infrastructure as code systems.
>

But you are not trying to autoscale Cassandra, do you?  Cloud Formation has
a decent support of EIPs, e.g. you can allocate them by declaring in
resources and then resolve the address to inject configuration parameters
for the application if needed.

> We are probably going to just have a VM startup script for now that
automatically updates the yaml on instance restart. It seems to be the
least-sucky approach at this point.

This is what we do for our docker-based setup.  I think we were just
following the documentation, though you're looking at the code proves that
this shouldn't be required..

Regards,
--
Alex


How to install an older minor release?

2019-04-02 Thread Oleksandr Shulgin
Hello,

We've just noticed that we cannot install older minor releases of Apache
Cassandra from Debian packages, as described on this page:
http://cassandra.apache.org/download/

Previously we were doing the following at the last step: apt-get install
cassandra==3.0.17

Today it fails with error:
E: Version '3.0.17' for 'cassandra' was not found

And `apt-get show cassandra` reports only one version available, the latest
released one: 3.0.18
The packages for the older versions are still in the pool:
http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/

Was it always the case that only the latest version is available to be
installed directly with apt or did something change recently?

Regards,
-- 
Alex


Re: Procedures for moving part of a C* cluster to a different datacenter

2019-04-03 Thread Oleksandr Shulgin
On Wed, Apr 3, 2019 at 12:28 AM Saleil Bhat (BLOOMBERG/ 731 LEX) <
sbha...@bloomberg.net> wrote:

>
> The standard procedure for doing this seems to be add a 3rd datacenter to
> the cluster, stream data to the new datacenter via nodetool rebuild, then
> decommission the old datacenter. A more detailed review of this procedure
> can be found here:
> http://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>
> However, I see two problems with the above protocol. First, it requires
> changes on the application layer because of the datacenter name change;
> e.g. all applications referring to the datacenter ‘Orlando’ will now have
> to be changed to refer to ‘Tampa’.
>

Alternatively, you may omit DC specification in the client and provide
internal network addresses as the contact points.

As such, I was wondering what peoples’ thoughts were on the following
> alternative procedure:
> 1) Kill one node in the old datacenter
> 2) Add a new node in the new datacenter but indicate that it is to REPLACE
> the one just shutdown; this node will bootstrap, and all the data which it
> is supposed to be responsible for will be streamed to it
>

I don't think this is going to work.  First, I believe streaming for
bootstrap or for replacing a node is DC-local, so the first node won't have
any peers to stream from.  Even if it would stream from the remote DC, this
single node will own 100% of the ring and will most likely die of the load
well before it finishes streaming.

Regards,
-- 
Alex


Re: Procedures for moving part of a C* cluster to a different datacenter

2019-04-03 Thread Oleksandr Shulgin
On Wed, Apr 3, 2019 at 4:37 PM Saleil Bhat (BLOOMBERG/ 731 LEX) <
sbha...@bloomberg.net> wrote:

>
> Thanks for the reply! One clarification: the replacement node WOULD be
> DC-local as far as Cassandra is is concerned; it would just be in a
> different physical DC. Using the Orlando -> Tampa example, suppose my DC
> was named 'floridaDC' in Cassandra. Then I would just kill a node in
> Orlando, and start a new one in Tampa with the same DC name, 'floridaDC'.
> So from Cassandra's perspective, the replacement node is in the same
> datacenter as the old one was. It will be responsible for the same tokens
> as the old Orlando node, and bootstrap accordingly.
>
> Would this work?
>

Ah, this is a different story.  Assuming you can figure out connectivity
between the locations and assign the rack for the replacement node
properly, I don't see why this shouldn't work.

At the same time, if you really care about data consistency, you will have
to run more repairs than with the documented procedure of adding/removing a
virtual DC.  Replacing a node does not work exactly like bootstrap does, so
after the streaming has finished you should repair the newly started node.
And I guess you really should run it after replacing every single node, not
after replacing all nodes.

--
Alex


Re: New user on Ubuntu 18.04 laptop, nodetest status throws NullPointerException

2019-04-03 Thread Oleksandr Shulgin
On Wed, Apr 3, 2019 at 4:23 PM David Taylor  wrote:

>
> $ nodetest status
> error: null
> -- StackTrace --
> java.lang.NullPointerException
> at
> org.apache.cassandra.config.DatabaseDescriptor.getDiskFailurePolicy(DatabaseDescriptor.java:1892)
>

Could it be that your user doesn't have permissions to read the config file
in /etc?

--
Alex


Re: Recover lost node from backup or evict/re-add?

2019-06-13 Thread Oleksandr Shulgin
On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa  wrote:

> To avoid violating consistency guarantees, you have to repair the replicas
> while the lost node is down
>

How do you suggest to trigger it?  Potentially replicas of the primary
range for the down node are all over the local DC, so I would go with
triggering a full cluster repair with Cassandra Reaper.  But isn't it going
to fail because of the down node?

It is also documented (I believe) that one should repair the node after it
finishes the "replace address" procedure.  So should one repair before and
after?

--
Alex


Re: very slow repair

2019-06-13 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 10:36 AM R. T. 
wrote:

>
> Well, actually by running cfstats I can see that the totaldiskspaceused is
> about ~ 1.2 TB per node in the DC1 and ~ 1 TB per node in DC2. DC2 was off
> for a while thats why there is a difference in space.
>
> I am using Cassandra 3.0.6 and
> my stream_throughput_outbound_megabits_per_sec is th4e default setting so
> according to my version is (200 Mbps or 25 MB/s)
>

And the other setting: compaction_throughput_mb_per_sec?  It is also highly
relevant for repair performance, as streamed in files need to be compacted
with the existing files on the nodes.  In our experience change in
compaction throughput limit is almost linearly reflected by the repair run
time.

The default 16 MB/s is too limiting for any production grade setup, I
believe.  We go as high as 90 MB/s on AWS EBS gp2 data volumes.  But don't
take it as a gospel, I'd suggest you start increasing the setting (e.g. by
doubling it) and observe how it affects repair performance (and client
latencies).

Have you tried with "parallel" instead of "DC parallel" mode?  The latter
one is really poorly named and it actually means something else, as neatly
highlighted in this SO answer: https://dba.stackexchange.com/a/175028

Last, but not least: are you using the default number of vnodes, 256?  The
overhead of large number of vnodes (times the number of nodes), can be
quite significant.  We've seen major improvements in repair runtime after
switching from 256 to 16 vnodes on Cassandra version 3.0.

Cheers,
--
Alex


Re: Speed up compaction

2019-06-13 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 11:28 AM Léo FERLIN SUTTON
 wrote:

>
> ## Cassandra configuration :
> 4 concurrent_compactors
> Current compaction throughput: 150 MB/s
> Concurrent reads/write are both set to 128.
>
> I have also temporarily stopped every repair operations.
>
> Any ideas about how I can speed this up ?
>

Hi,

What is the compaction strategy used by this column family?

Do you observe this behavior on one of the nodes only?  Have you tried to
cancel this compaction and see if a new one is started and makes better
progress?  Can you try to restart the affected node?

Regards,
--
Alex


Re: Speed up compaction

2019-06-13 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 2:07 PM Léo FERLIN SUTTON
 wrote:

>
>  Overall we are talking about a 1.08TB table, using LCS.
>
> SSTable count: 1047
>> SSTables in each level: [15/4, 10, 103/100, 918, 0, 0, 0, 0, 0]
>
> SSTable Compression Ratio: 0.5192269874287099
>
> Number of partitions (estimate): 7282253587
>
>
> We have recently (about a month ago) deleted about 25% of the data in that
> table.
>
> Letting Cassandra reclaim the disk space on it's own (via regular
> compactions) was too slow for us, so we wanted to force a compaction on the
> table to reclaim the disk space faster.
>

To be clear, that compaction task is running the major compaction for this
column family?  I have no experience with Leveled compaction strategy, so
not really sure what behavior to expect from it.  I can imagine that with
that many SSTables and a major compaction, there might be quite some
overhead as it does more than an ordinary merge-sort as I would expect from
Size-tiered.

--
Alex


Re: very slow repair

2019-06-13 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 2:09 PM Léo FERLIN SUTTON
 wrote:

> Last, but not least: are you using the default number of vnodes, 256?  The
>> overhead of large number of vnodes (times the number of nodes), can be
>> quite significant.  We've seen major improvements in repair runtime after
>> switching from 256 to 16 vnodes on Cassandra version 3.0.
>
>
> Is there a recommended procedure to switch the amount of vnodes ?
>

Yes.  One should deploy a new virtual DC with desired configuration and
rebuild from the original one, then decommission the old virtual DC.

With the smaller number of vnodes you should use
allocate_tokens_for_keyspace configuration parameter to ensure uniform load
distribution.  The caveat is that the nodes allocate tokens before they
bootstrap, so the very first nodes will not have keyspace information
available.  This can be worked around, though it is not trivial.  See this
thread for our past experience:
https://lists.apache.org/thread.html/396f2d20397c36b9cff88a0c2c5523154d420ece24a4dafc9fde3d1f@%3Cuser.cassandra.apache.org%3E

--
Alex


Re: Recover lost node from backup or evict/re-add?

2019-06-13 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa  wrote:

> On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
> On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa  wrote:
>
> To avoid violating consistency guarantees, you have to repair the replicas
>> while the lost node is down
>>
>
> How do you suggest to trigger it?  Potentially replicas of the primary
> range for the down node are all over the local DC, so I would go with
> triggering a full cluster repair with Cassandra Reaper.  But isn't it going
> to fail because of the down node?
>
> Im not sure there’s an easy and obvious path here - this is something TLP
> may want to enhance reaper to help with.
>
> You have to specify the ranges with -st/-et, and you have to tell it to
> ignore the down host with -hosts. With vnodes you’re right that this may be
> lots and lots of ranges all over the ring.
>
> There’s a patch proposed (maybe committed in 4.0) that makes this a
> nonissue by allowing bootstrap to stream one repaired set and all of the
> unrepaired replica data (which is probably very small if you’re running IR
> regularly), which accomplished the same thing.
>

Ouch, it really hurts to learn this. :(

> It is also documented (I believe) that one should repair the node after it
> finishes the "replace address" procedure.  So should one repair before and
> after?
>
> You do not need to repair after the bootstrap if you repair before. If the
> docs say that, they’re wrong. The joining host gets writes during bootstrap
> and consistency levels are altered during bootstrap to account for the
> joining host.
>

This is what I had in mind (what makes replacement different from actual
bootstrap of a new node):
http://cassandra.apache.org/doc/latest/operating/topo_changes.html?highlight=replace%20address#replacing-a-dead-node


Note

If any of the following cases apply, you MUST run repair to make the replaced
node consistent again, since it missed ongoing writes during/prior to
bootstrapping. The *replacement* timeframe refers to the period from when
the node initially dies to when a new node completes the replacement
process.


   1. The node is down for longer than max_hint_window_in_ms before being
  replaced.
  2. You are replacing using the same IP address as the dead node and
  replacement takes longer than max_hint_window_in_ms.


I would imagine that any production size instance would take way longer to
replace than the default max hint window (which is 3 hours, AFAIK).  Didn't
remember the same IP restriction, but at least this I would also expect to
be the most common setup.

--
Alex


Re: Recover lost node from backup or evict/re-add?

2019-06-14 Thread Oleksandr Shulgin
On Thu, Jun 13, 2019 at 3:41 PM Jeff Jirsa  wrote:

>
> Bootstrapping a new node does not require repairs at all.
>

Was my understanding as well.

Replacing a node only requires repairs to guarantee consistency to avoid
> violating quorum because streaming for bootstrap only streams from one
> replica
>
> Think this way:
>
> Host 1, 2, 3 in a replica set
> You write value A to some key
> It lands on hosts 1 and 3. Host 2 was being restarted or something
> Host 2 comes back up
> Host 3 fails
>
> If you replace 3 with 3’ -
> 3’ May stream from host 1 and now you’ve got a quorum if replicas with A
> 3’ may stream fr host 2, and now you’ve got a quorum if replicas without
> A. This is illegal.
>
> This is just a statistics game - do you have hosts missing writes? If so,
> are hints delivering them when those hosts come back? What’s the cost of
> violating consistency in that second scenario to you?
>
> If you’re running something where correctness really really really
> matters, you must repair first. If you’re actually running a truly eventual
> consistency use case and reading stale writes is fine, you probably won’t
> ever notice.
>

Alright, this makes it much more clear, thank you.

In any case these docs are weird and wrong - joining nodes get writes in
> all versions of Cassandra for the past few years (at least 2.0+), so the
> docs really need to be fixed.
>

:(

--
Alex


Re: How can I check cassandra cluster has a real working function of high availability?

2019-06-17 Thread Oleksandr Shulgin
On Sat, Jun 15, 2019 at 4:31 PM Nimbus Lin  wrote:

> Dear cassandra's pioneers:
> I am a 5 years' newbie,  it is until now that I have time to use
> cassandra. but I cann't check cassandra's high availabily when I stop a
> seed node or none seed DN as CGE or Greenplum.
> Would someone can tell me how to check the cassandra's high
> availability? even I change the consistency level from one to local_one,
> the cqlsh's select is always return an error of NoHostAvailable.
>
>  By the way, would you like to answer me other two questions:
> 2nd question: although cassandrfa's consistency is a per-operation
> setting, isn't there a whole all operations' consistency setting method?
> 3rd question: how can I can cassandra cluster's running variables as
> mysql's show global variables? such as hidden variable of  auto_bootstrap?
>

Hi,

For the purpose of serving client requests, all nodes are equal -- seed or
not.  So it shouldn't matter which node you are stopping (or making it
unavailable for the rest of the cluster using other means).

In order to test it with cqlsh you should ensure that the replication
factors of the keyspace you're testing with is sufficient.  Given the
NoHostAvailable exception that you are experiencing at consistency level
ONE (or LOCAL_ONE), I can guess that you are testing with a keyspace with
replication factor 1 and the node which is unavailable happen to be
responsible for the particular partition.

For your second question: it depends on a client (or "client driver") you
are using.  In cqlsh you can set consistency level that will be applied for
all subsequent queries using the "CONSISTENCY ..." command.  I think that
the Java driver does have an option to set the default consistency level,
as well as has an option to set consistency level per query.  Most likely
this is also true for Python and other drivers.

And for the third question: I'm not aware of a CQL or nodetool command that
would fulfill the need.  Most likely it is possible to learn (and update)
most of the configuration parameters using JMX, e.g. with JConsole:
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsMonitoring.html#opsMonitoringJconsole

Cheers,
--
Alex


Re: Cassandra migration from 1.25 to 3.x

2019-06-17 Thread Oleksandr Shulgin
On Mon, Jun 17, 2019 at 9:30 AM Anurag Sharma 
wrote:

>
> We are upgrading Cassandra from 1.25 to 3.X. Just curious if there is any
> recommended open source utility for the same.
>

Hi,

The "recommended  open source utility" is the Apache Cassandra itself. ;-)

Given the huge difference between the major versions, though, you will need
a decent amount of planning and preparation to successfully complete such a
migration.  Most likely you will want to do it in small steps, first
upgrading to the latest minor version in the 1.x series, then making a jump
to 2.x, then to 3.0, and only then to 3.x if you really mean to.  On each
upgrade step, be sure to examine the release notes carefully to understand
if there is any impact for your cluster and/or client applications.  Do
have a test system with preferably identical setup and configuration and
execute the upgrade steps there first to verify your expectations.

Good luck!
--
Alex


Re: Compaction resume after node restart and upgrade

2019-06-18 Thread Oleksandr Shulgin
On Tue, Jun 18, 2019 at 3:08 AM Jeff Jirsa  wrote:

> Yes  - the incomplete sstable will be deleted during startup (in 3.0 and
> newer there’s a transaction log of each compaction in progress - that gets
> cleaned during the startup process)
>

Wait, does that mean that in pre-3.0 versions one can get incomplete
SSTables laying around and not cleaned up automatically?..

--
Alex


Re: Cassandra Tombstone

2019-06-18 Thread Oleksandr Shulgin
On Tue, Jun 18, 2019 at 8:06 AM ANEESH KUMAR K.M  wrote:

>
> I am using Cassandra cluster with 3 nodes which is hosted on AWS. Also we
> have NodeJS web Application which is on AWS ELB. Now the issue is that,
> when I add 2 or more servers (nodeJS) in AWS ELB then the delete queries
> are not working on Cassandra.
>

Please provide a more concrete description than "not working".  Do you get
an error?  Which one?  Does it "not working" silently, i.e. w/o an error,
but you don't observe the expected effect?  How does the delete query look
like, what is the effect you expect and what do you observe instead?

--
Alex


Re: Upgrade sstables vs major compaction

2019-06-24 Thread Oleksandr Shulgin
On Fri, Jun 21, 2019 at 7:02 PM Nitan Kainth  wrote:

>
> we upgraded binaries from 3.0 to 4.0.
>

Where did you get the binaries for 4.0?  It is not released officially yet,
so I guess you were using SVN trunk?  Or was there a pre-release?

we run major compaction periodically for some valid reasons.
>
> Now, we are considering running major compaction instead of
> upgradesstables. Repair is disabled, cluster have normal reads/writes (few
> hundered/second).
>
> Plan is to run major compaction to clear tombstones and reduce data
> volume. Once major compaction is done, we will run upgradesstable for any
> leftover sstables if any.
>
> Can you please advise major compaction could cause issues instead of
> upgradesstables? Or any other feedback in this action plan?
>

Sounds legit.

-- 
Alex


Re: Ec2MultiRegionSnitch difficulties (3.11.2)

2019-06-27 Thread Oleksandr Shulgin
On Fri, Jun 28, 2019 at 3:14 AM Voytek Jarnot 
wrote:

> Curious if anyone could shed some light on this. Trying to set up a
> 4-node, one DC (for now, same region, same AZ, same VPC, etc) cluster in
> AWS.
>
> All nodes have the following config (everything else basically standard):
> cassandra.yaml:
>   listen_address: NODE?_PRIVATE_IP
>   seeds: "NODE1_ELASTIC_IP"
>   endpoint_snitch: Ec2MultiRegionSnitch
> cassandra-rackdc.properties:
>   empty except prefer_local=true
>
> I've tried setting
>   broadcast_address: NODE?_ELASTIC_IP
> But that didn't help - and it seems redundant, as it appears that that's
> what the Ec2MultiRegionSnitch does anyway.
>

Hi,

It is quite confusing, as I remember hitting this issue before.  You need
to set the following:

broadcast_rpc_address: NODE?_ELASTIC_IP

Even though, listen_address and broadcast_address should be set by the EC2
snitch automatically, the above parameter is not set automatically and this
is the one that the nodes are using to talk to each other.

Cheers,
--
Alex


Re: Restore from EBS onto different cluster

2019-06-28 Thread Oleksandr Shulgin
On Fri, Jun 28, 2019 at 8:37 AM Ayub M  wrote:

> Hello, I have a cluster with 3 nodes - say cluster1 on AWS EC2 instances.
> The cluster is up and running, took snapshot of the keyspaces volume.
>
> Now I want to restore few tables/keyspaces from the snapshot volumes, so I
> created another cluster say cluster2 and attached the snapshot volumes on
> to the new cluster's ec2 nodes. Cluster2 is not starting bcz the system
> keyspace in the snapshot taken was having cluster name as cluster1 and the
> cluster on which it is being restored is cluster2. How do I do a restore in
> this case? I do not want to do any modifications to the existing cluster.
>

Hi,

I would try to use the same cluster name just to be able to restore it and
ensure that nodes of cluster2 cannot talk to cluster1 by the means of
setting up Security Groups, for example.

Also when I do restore do I need to think about the token ranges of the old
> and new cluster's mapping?
>

Absolutely.  For a successful restore you must ensure that you restore a
snapshot from a rack (Avalability Zone, if you're using the EC2Snitch) into
the same rack in the new cluster.

Regards,
--
Alex


Re: Securing cluster communication

2019-06-28 Thread Oleksandr Shulgin
On Fri, Jun 28, 2019 at 3:57 PM Marc Richter  wrote:

>
> How is this dealt with in Cassandra? Is setting up firewalls the only
> way to allow only some nodes to connect to the ports 7000/7001?
>

Hi,

You can set

server_encryption_options:
internode_encryption: all
...

and distribute the same key/trust-store on each node of the same cluster.

Cheers,
--
Alex


Re: Running Node Repair After Changing RF or Replication Strategy for a Keyspace

2019-06-28 Thread Oleksandr Shulgin
On Fri, Jun 28, 2019 at 11:29 PM Jeff Jirsa  wrote:

>  you often have to run repair after each increment - going from 3 -> 5
> means 3 -> 4, repair, 4 -> 5 - just going 3 -> 5 will violate consistency
> guarantees, and is technically unsafe.
>

Jeff,

How going from 3 -> 4 is *not violating* consistency guarantees already?
Are you assuming quorum writes and reads and a perfectly repaired keyspace?

Regards,
--
Alex


Re: Running Node Repair After Changing RF or Replication Strategy for a Keyspace

2019-06-30 Thread Oleksandr Shulgin
On Sat, Jun 29, 2019 at 5:49 AM Jeff Jirsa  wrote:

> If you’re at RF= 3 and read/write at quorum, you’ll have full visibility
> of all data if you switch to RF=4 and continue reading at quorum because
> quorum if 4 is 3, so you’re guaranteed to overlap with at least one of the
> two nodes that got all earlier writes
>
> Going from 3 to 4 to 5 requires a repair after 4.
>

Understood, thanks for detailing it.

At the same time, is it ever practical to use RF > 3?  Is it practical to
switch to 5 if you already have 3?

I imagine this question is popping up more often in a context of switching
from RF < 3 to =3.  As well as switching from non-NTS to NTS, in which case
it is indeed quite troublesome, as you have pointed out.

--
Alex


Re: how to change a write's and a read's consistency level separately in cqlsh?

2019-07-01 Thread Oleksandr Shulgin
On Sat, Jun 29, 2019 at 6:19 AM Nimbus Lin  wrote:

>
>   On the 2nd question, would you like to tell me how to change a
> write's and a read's consistency level separately in cqlsh?
>

Not that I know of special syntax for that, but you may add an explicit
"CONSISTENCY " command before every command in your script, if you
like.

Otherwise, how the document's R+W>Replicator to realize to guarantee a
> strong consistency write and read?
>

The most common setup, AFAIK, is to use a replication factor of 3 and set
consistency levels for both Reads and Writes to one of the Quorum levels:
e.g. both LOCAL_QUORUM, or LOCAL_QUORUM for the Reads and EACH_QUORUM for
the Writes, etc.  Since the quorum of 3 nodes is 2, you arrive at 2 + 2 >
3, which is what you're asked for.

--
Alex


Re: Splitting 2-datacenter cluster into two clusters

2019-07-11 Thread Oleksandr Shulgin
On Thu, Jul 11, 2019 at 5:04 PM Voytek Jarnot 
wrote:

> My google-fu is failing me this morning. I'm looking for any tips on
> splitting a 2 DC cluster into two separate clusters. I see a lot of docs
> about decomissioning a datacenter, but not much in the way of disconnecting
> datacenters into individual clusters, but keeping each one as-is data-wise
> (aside from replication factor, of course).
>
> Our setup is simple: two DCs (dc1 and dc2), two seed nodes (both in dc1;
> yes, I know not the recommended config), one keyspace (besides the system
> ones) replicated in both DCs. I'm trying to end up with two clusters with 1
> DC in each.
>

The first step that comes to my mind would be this:

cql in dc1=> ALTER KEYSPACE data WITH ... {'dc1': RF};

cql in dc2=> ALTER KEYSPACE data WITH ... {'dc2': RF};

Then you probably want to update seeds list of each DC to contain only
nodes from that local DC.  The next step is tricky, but somehow it feels
that you will need to isolate then networks, so that nodes in DC1 see all
nodes from DC2 as DOWN and vice-versa, then assassinate the nodes from
remote DC.

Testing in a lab first is probably a good idea. ;-)

-- 
Alex


Re: TWCS generates large numbers of sstables on only some nodes

2019-07-15 Thread Oleksandr Shulgin
On Mon, Jul 15, 2019 at 6:20 PM Carl Mueller
 wrote:

> Related to our overstreaming, we have a cluster of about 25 nodes, with
> most at about 1000 sstable files (Data + others).
>
> And about four that are at 20,000 - 30,000 sstable files (Data+Index+etc).
>
> We have vertically scaled the outlier machines and turned off compaction
> throttling thinking it was compaction that couldn't keep up. That
> stabilized the growth, but the sstable count is not going down.
>
> The TWCS code seems to highly bias towards "recent" tables for compaction.
> We figured we'd boost the throughput/compactors and that would solve the
> more recent ones, and the older ones would fall off. But the number of
> sstables has remained high on a daily basis on the couple "bad nodes".
>
> Is this simply a lack of sufficient compaction throughput? Is there
> something in TWCS that would force frequent flushing more than normal?
>

What does nodetool compactionstats says about pending compaction tasks on
the affected nodes with the high number of files?

Regards,
-- 
Alex


Re: TWCS generates large numbers of sstables on only some nodes

2019-07-16 Thread Oleksandr Shulgin
On Tue, Jul 16, 2019 at 5:54 PM Carl Mueller
 wrote:

> stays consistently in the 40-60 range, but only recent tables are being
> compacted.
>

I would be alarmed at this point.  It definitely feels like not aggressive
enough compaction: can you relax the throttling or afford to have more
concurrent compactors, if only on these two nodes?

--
Alex


Re: Jmx metrics shows node down

2019-07-29 Thread Oleksandr Shulgin
On Mon, Jul 29, 2019 at 1:21 PM Rahul Reddy 
wrote:

>
> Decommissioned 2 nodes from cluster nodetool status doesn't  list the
> nodes as expected but jmx metrics shows still those 2 nodes has down.
> Nodetool gossip shows the 2 nodes in Left state. Why does my jmx still
> shows those nodes down even after 24 hours. Cassandra version 3.11.3 ?
>

AFAIK, the nodes are not removed from gossip for 72 hours by default.


> Anything else need to be done?
>

Wait another 48 hours? ;-)

--
Alex


Re: Assassinate or decommission?

2019-07-30 Thread Oleksandr Shulgin
On Tue, Jul 30, 2019 at 12:11 PM Rhys Campbell
 wrote:

> Are you sure it says to use assassinate as the first resort? Definately
> not the case
>

It does.  I think the reason is that it says that you should run a full
repair first, and before that--stop writing to the DC being
decommissioned.  This ensures that you do not lose any writes that made it
only to the DC to be decommissioned.  This means no need to stream the data
out of the nodes being removed.

--
Alex


Re: RMI TCP Connection threads

2019-07-30 Thread Oleksandr Shulgin
On Tue, Jul 30, 2019 at 12:34 PM Vlad  wrote:

> Restarting Cassandra helped.
>

But for how long?..

--
Alex


Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Oleksandr Shulgin
On Wed, Jul 31, 2019 at 7:10 AM Martin Xue  wrote:

> Hello,
>
> Good day. This is Martin.
>
> Can someone help me with the following query regarding Cassandra repair
> and compaction?
>

Martin,

This blog post from The Last Pickle provides an in-depth explanation as
well as some practical advice:
https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html


The status hasn't changed from the time of writing, as far as I'm aware.
Nonetheless, you might want to upgrade to the latest released version in
3.0 series, which is 3.0.18: http://cassandra.apache.org/download/

Regards,
--
Alex


Re: What happened about one node in cluster is down?

2019-08-02 Thread Oleksandr Shulgin
>
> > 3. what is the best strategy under this senario?
>
> Go to RF=3 or read and write at quorum so you’re doing 3/4 instead of 2/2


Jeff, did you mean "2/3 vs. 2/2"?

-Alex


Re: Cassandra read requests not getting timeout

2019-08-05 Thread Oleksandr Shulgin
On Mon, Aug 5, 2019 at 8:50 AM nokia ceph  wrote:

> Hi Community,
>
> I am using Cassanadra 3.0.13 . 5 node cluster simple topology. Following
> are the timeout  parameters in yaml file:
>
> # grep timeout /etc/cassandra/conf/cassandra.yaml
> cas_contention_timeout_in_ms: 1000
> counter_write_request_timeout_in_ms: 5000
> cross_node_timeout: false
> range_request_timeout_in_ms: 1
> read_request_timeout_in_ms: 1
> request_timeout_in_ms: 1
> truncate_request_timeout_in_ms: 6
> write_request_timeout_in_ms: 2000
>
> i'm trying a cassandra query using cqlsh and it is not getting timeout.
>
> #time cqlsh 10.50.11.11 -e "CONSISTENCY QUORUM; select
> asset_name,profile_name,job_index,active,last_valid_op,last_valid_op_ts,status,status_description,live_depth,asset_type,dest_path,source_docroot_name,source_asset_name,start_time,end_time,iptv,drm,geo,last_gc
> from cdvr.jobs where model_type ='asset' AND docroot_name='vx030'
>  LIMIT 10 ALLOW FILTERING;"
> Consistency level set to QUORUM.
> ()
> ()
> (79024 rows)
>
> real16m30.488s
> user0m39.761s
> sys 0m3.896s
>
> The query took 16.5 minutes  to display the output. But my
> read_request_timeout is 10 seconds. why the query doesn't got timeout after
> 10 s ??
>

Hi Renoy,

Have you tried the same query with enabling TRACING beforehand?
https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlshTracing.html

It doesn't sound all too likely that it has taken the client 16 minutes to
display the resultset, but this is definitely not included in the request
timeout from the server point of view.

Cheers,
--
Alex


Re: To Repair or Not to Repair

2019-08-13 Thread Oleksandr Shulgin
On Thu, Mar 14, 2019 at 9:55 PM Jonathan Haddad  wrote:

> My coworker Alex (from The Last Pickle) wrote an in depth blog post on
> TWCS.  We recommend not running repair on tables that use TWCS.
>
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>

Hi,

I was wondering about this again, as I've noticed one of the nodes in our
cluster accumulating ten times the number of files compared to the average
across the rest of cluster.  The files are all coming from a table with
TWCS and repair (running with Reaper) is ongoing.  The sudden growth
started around 24 hours ago as the affected node was restarted due to
failing AWS EC2 System check.  Now I'm thinking again if we should be
running those repairs at all. ;-)

In the Summary of the blog post linked above, the following is written:

It is advised to disable read repair on TWCS tables, and use an agressive
tombstone purging strategy as digest mismatches during reads will still
trigger read repairs.


Was it meant to read "disable anti-entropy repair" instead?  I find it
confusing otherwise.

Regards,
--
Alex


  1   2   3   >