Re: gossipinfo contains two nodes dead for more than two years

2019-08-28 Thread Vincent Rischmann
Yep, they're not visible in both ring and status.

On Wed, Aug 28, 2019, at 17:08, Jeff Jirsa wrote:
> Based on what you've posted, I assume the instances are not visible in 
> `nodetool ring` or `nodetool status`, and the only reason you know they're 
> still in gossipinfo is you see them in the logs? If that's the case, then 
> yes, I would do `nodetool assassinate`.
> 
> 
> 
> On Wed, Aug 28, 2019 at 7:33 AM Vincent Rischmann  
> wrote:
>> __
>> Hi,
>> 
>> while replacing a node in a cluster I saw this log:
>> 
>>  2019-08-27 16:35:31,439 Gossiper.java:995 - InetAddress /10.15.53.27 is now 
>> DOWN
>> 
>> it caught my attention because that ip address doesn't exist anymore in the 
>> cluster and it hasn't for a long time.
>> 
>> After some reading I ran `nodetool gossipinfo` and I saw these entries which 
>> are nodes that don't exist anymore:
>> 
>>  /10.15.53.27
>>  generation:1503480618
>>  heartbeat:26970
>>  STATUS:2:hibernate,true
>>  LOAD:26810:6.17363354147E11
>>  SCHEMA:101:d21b1e47-f226-3417-8de7-5802518ae824
>>  DC:10:DC1
>>  RACK:12:RAC1
>>  RELEASE_VERSION:6:2.1.18
>>  INTERNAL_IP:8:10.15.53.27
>>  RPC_ADDRESS:5:10.15.53.27
>>  SEVERITY:26972:0.0
>>  NET_VERSION:3:8
>>  HOST_ID:4:2488fccc-108a-4a9d-ad43-5e8b8b6ee17b
>>  TOKENS:1:
>>  /10.5.1.16
>>  generation:1503636779
>>  heartbeat:324
>>  STATUS:2:hibernate,true
>>  LOAD:204:2.601990697532E12
>>  SCHEMA:14:d21b1e47-f226-3417-8de7-5802518ae824
>>  DC:10:DC1
>>  RACK:12:RAC1
>>  RELEASE_VERSION:6:2.1.18
>>  INTERNAL_IP:8:10.5.1.16
>>  RPC_ADDRESS:5:10.5.1.16
>>  SEVERITY:326:0.0
>>  NET_VERSION:3:8
>>  HOST_ID:4:2488fccc-108a-4a9d-ad43-5e8b8b6ee17b
>>  TOKENS:1:
>> 
>> the generations are:
>> 
>> - Wed, 23 Aug 2017 09:30:18 GMT
>> - Fri, 25 Aug 2017 04:52:59 GMT
>> 
>> I don't remember what we did at that time but it looks like we botched 
>> something while joining a node or something.
>> 
>> After reading https://thelastpickle.com/blog/2018/09/18/assassinate.html I'm 
>> thinking of doing the following:
>> 
>> * nodetool removenode 10.15.53.27
>> * if it doesn't work for some reason: nodetool assassinate 10.15.53.27
>> 
>> Since those nodes have been long dead and don't appear in system.peer I 
>> don't anticipate any problems but I'd like some confirmation that this can't 
>> break my cluster.
>> 
>> Thanks !

gossipinfo contains two nodes dead for more than two years

2019-08-28 Thread Vincent Rischmann
Hi,

while replacing a node in a cluster I saw this log:

 2019-08-27 16:35:31,439 Gossiper.java:995 - InetAddress /10.15.53.27 is now 
DOWN

it caught my attention because that ip address doesn't exist anymore in the 
cluster and it hasn't for a long time.

After some reading I ran `nodetool gossipinfo` and I saw these entries which 
are nodes that don't exist anymore:

 /10.15.53.27
 generation:1503480618
 heartbeat:26970
 STATUS:2:hibernate,true
 LOAD:26810:6.17363354147E11
 SCHEMA:101:d21b1e47-f226-3417-8de7-5802518ae824
 DC:10:DC1
 RACK:12:RAC1
 RELEASE_VERSION:6:2.1.18
 INTERNAL_IP:8:10.15.53.27
 RPC_ADDRESS:5:10.15.53.27
 SEVERITY:26972:0.0
 NET_VERSION:3:8
 HOST_ID:4:2488fccc-108a-4a9d-ad43-5e8b8b6ee17b
 TOKENS:1:
 /10.5.1.16
 generation:1503636779
 heartbeat:324
 STATUS:2:hibernate,true
 LOAD:204:2.601990697532E12
 SCHEMA:14:d21b1e47-f226-3417-8de7-5802518ae824
 DC:10:DC1
 RACK:12:RAC1
 RELEASE_VERSION:6:2.1.18
 INTERNAL_IP:8:10.5.1.16
 RPC_ADDRESS:5:10.5.1.16
 SEVERITY:326:0.0
 NET_VERSION:3:8
 HOST_ID:4:2488fccc-108a-4a9d-ad43-5e8b8b6ee17b
 TOKENS:1:

the generations are:

- Wed, 23 Aug 2017 09:30:18 GMT
- Fri, 25 Aug 2017 04:52:59 GMT

I don't remember what we did at that time but it looks like we botched 
something while joining a node or something.

After reading https://thelastpickle.com/blog/2018/09/18/assassinate.html I'm 
thinking of doing the following:

* nodetool removenode 10.15.53.27
* if it doesn't work for some reason: nodetool assassinate 10.15.53.27

Since those nodes have been long dead and don't appear in system.peer I don't 
anticipate any problems but I'd like some confirmation that this can't break my 
cluster.

Thanks !

Interrogation about expected performance

2017-09-22 Thread Vincent Rischmann
Hello,

we recently added a new 5 node cluster used only for a single service,
and right now it's not even read from, we're just loading data into it.
Each node are identical: 32Gib of RAM, 4 core Xeon E5-1630, 2 SSDs in
Raid 0, Cassandra v3.11
We have two tables with roughly this schema:

CREATE TABLE by_app(
   app_id text,
   partition int,
   install_id uuid,
   counts map,
   PRIMARY KEY ((app_id, partition), install_id)
);

CREATE TABLE by_install_id(
   app_id text,
   install_id uuid,
   counts map,
   PRIMARY KEY ((install_id))
);

We're processing events, and each event triggers a write to both
these tables.
Right now according to my metrics we can't quite get above 200k
writes/sec (around 100k/s per table).I'm wondering if these numbers seem 
reasonable or if they're low.

I'm considering changing the data model to not have a map anymore but
that would make the selection more complicated so before doing that I'd
like to have your opinions.
Thanks.


Re: Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Thanks Alexander for the help, lots of good info in there.

I'll try to switch back to CMS and see how it fares.


On Tue, Jun 6, 2017, at 05:06 PM, Alexander Dejanovski wrote:
> Hi Vincent,
> 
> it is very clear, thanks for all the info.
> 
> I would not stick with G1 in your case, as it requires much more heap
> to perform correctly (>24GB).> CMS/ParNew should be much more efficient here 
> and I would go with some
> settings I usually apply on big workloads : 16GB heap / 6GB new gen /
> MaxTenuringThreshold = 5> 
> Large partitions are indeed putting pressure on your heap and
> tombstones as well.> One of your queries is particularly caveated : SELECT 
> app,
> platform, slug, partition, user_id, attributes, state, timezone,
> version FROM table WHERE app = ? AND platform = ? AND slug = ? AND
> partition = ? LIMIT ?> Although you're using the LIMIT clause, it will read 
> the whole
> partition, merge it in memory and only then will it apply the LIMIT.
> Check this blog post for more detailed info :
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html>
>  This can lead you to read the whole 450MB and all the tombstones even
> though you're only targeting a few rows in the partition.> Large partitions 
> are also creating heap pressure during compactions,
> which will issue warnings in the logs (look for "large partition").> 
> You should remove the delete/insert logged batch as it will spread
> over multiple partitions, which is bad for many reasons. It gives you
> no real atomicity, but just the guaranty that if one query succeeds,
> then the rest of the queries will eventually succeed (and that could
> possibly take some time, leaving the cluster in an inconsistent state
> in the meantime). Logged batches have a lot of overheads, one of them
> being a write of the queries to the batchlog table, which will be
> replicated to 2 other nodes, and then deleted after the batch has
> completed.> You'd better turn those into async queries with an external retry
> mechanism.> 
> Tuning the GC should help coping with your data modeling issues. 
> 
> For safety reasons, only change the GC settings for one canary
> node, observe and compare its behavior over a full day. If the
> results are satisfying, generalize to the rest of the cluster. You
> need to experience peak load to make sure the new settings are
> fixing your issues.> 
> Cheers,
> 
> 
> 
> On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:>> __
>> Hi Alexander.
>> 
>> Yeah, the minor GCs I see are usually around 300ms but sometimes
>> jumping to 1s or even more.>> 
>> Hardware specs are:
>>   - 8 core CPUs
>>   - 32 GB of RAM
>>   - 4 SSDs in hardware Raid 0, around 3TB of space per node
>>  
>> GC settings:-Xmx12G -Xms12G -XX:+UseG1GC -
>> XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
>> XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
>> XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled>> 
>> According to the graphs, there are approximately one Young GC every
>> 10s or so, and almost no Full GCs (for example the last one was 2h45
>> after the previous one).>> 
>> Computed from the log files, average Young GC seems to be around
>> 280ms and max is 2.5s.>> Average Full GC seems to be around 4.6s and max is 
>> 5.3s.
>> I only computed this on one node but the problem occurs on every node
>> as far as I can see.>> 
>> I'm open to tuning the GC, I stuck with defaults (that I think I saw
>> in the cassandra conf, I'm not sure).>> 
>> Number of SSTables looks ok, p75 is at 4 (as is the max for that
>> matter). Partitions size is a problem yeah, this particular table
>> from which we read a lot has a max partition size of 450 MB. I've
>> known about this problem for a long time actually, we already did a
>> bunch of work reducing partition size I think a year ago, but this
>> particular table is tricky to change.>> 
>> One thing to note about this table is that we do a ton of DELETEs
>> regularly (that we can't really stop doing except completely
>> redesigning the table), so we have a ton of tombstones too. We have a
>> lot of warnings about the tombstone threshold when we do our selects
>> (things like "Read 2001 live and 2528 tombstone cells"). I suppose
>> this could be a factor ?>> 
>> Each query reads from a single partition key yes, but as said we
>> issue a lot of them at the same time.>> 
>> The table looks like this (simplified):
>> 
>

Re: Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Hi Alexander.

Yeah, the minor GCs I see are usually around 300ms but sometimes jumping
to 1s or even more.
Hardware specs are:
  - 8 core CPUs
  - 32 GB of RAM
  - 4 SSDs in hardware Raid 0, around 3TB of space per node
 
GC settings:-Xmx12G -Xms12G -XX:+UseG1GC -
XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled
According to the graphs, there are approximately one Young GC every 10s
or so, and almost no Full GCs (for example the last one was 2h45 after
the previous one).
Computed from the log files, average Young GC seems to be around 280ms
and max is 2.5s.Average Full GC seems to be around 4.6s and max is 5.3s.
I only computed this on one node but the problem occurs on every node as
far as I can see.
I'm open to tuning the GC, I stuck with defaults (that I think I saw in
the cassandra conf, I'm not sure).
Number of SSTables looks ok, p75 is at 4 (as is the max for that
matter). Partitions size is a problem yeah, this particular table from
which we read a lot has a max partition size of 450 MB. I've known about
this problem for a long time actually, we already did a bunch of work
reducing partition size I think a year ago, but this particular table is
tricky to change.
One thing to note about this table is that we do a ton of DELETEs
regularly (that we can't really stop doing except completely redesigning
the table), so we have a ton of tombstones too. We have a lot of
warnings about the tombstone threshold when we do our selects (things
like "Read 2001 live and 2528 tombstone cells"). I suppose this could be
a factor ?
Each query reads from a single partition key yes, but as said we issue a
lot of them at the same time.
The table looks like this (simplified):

CREATE TABLE table (
app text,
platform text,
slug text,
partition int,
user_id text,
attributes blob,
state int,
timezone text,
version int,
PRIMARY KEY ((app, platform, slug, partition), user_id)
) WITH CLUSTERING ORDER BY (user_id ASC)

And the main queries are:

SELECT app, platform, slug, partition, user_id, attributes, state,
timezone, versionFROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
= ? LIMIT ?
SELECT app, platform, slug, partition, user_id, attributes, state,
timezone, versionFROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
= ? AND user_id >= ? LIMIT ?
partition is basically an integer that goes from 0 to 15, and we always
select the 16 partitions in parallel.
Note that we write constantly to this table, to update some fields,
insert the user into the new "slug" (a slug is an amalgamation of
different parameters like state, timezone etc that allows us the
efficiently query all users from a particular "app" with a given "slug".
At least that's the idea, as seen here it causes us some trouble).
And yes, we do use batches to write this data, this is how we process
each user update:  - SELECT from a "master" slug to get the fields we need
  - from that, compute a list of slugs the user had and a list of slugs
the user should have (for example if he changes timezone we have to
update the slug)  - delete the user from the slug he shouldn't be in and 
insert the user
where he should be.The last part, delete/insert is done in a logged batch. 

I hope it's relatively clear.

On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
> Hi Vincent, 
> 
> dropped messages are indeed common in case of long GC pauses. 
> Having 4s to 6s pauses is not normal and is the sign of an unhealthy
> cluster. Minor GCs are usually faster but you can have long ones too.> 
> If you can share your hardware specs along with your current GC
> settings (CMS or G1, heap size, young gen size) and a distribution of
> GC pauses (rate of minor GCs, average and max duration of GCs) we
> could try to help you tune your heap settings.> You can activate full GC 
> logging which could help in fine tuning
> MaxTenuringThreshold and survivor space sizing.> 
> You should also check for max partition sizes and number of SSTables
> accessed per read. Run nodetool cfstats/cfhistograms on your tables to
> get both. p75 should be less or equal to 4 in number of SSTables  and
> you shouldn't have partitions over... let's say 300 MBs. Partitions >
> 1GB are a critical problem to address.> 
> Other things to consider are : 
> Do you read from a single partition for each query ? 
> Do you use collections that could spread over many SSTables ? 
> Do you use batches for writes (although your problem doesn't seem to
> be write related) ?> Can you share the queries from your scheduled selects 
> and the
> data model ?> 
> Cheers,
> 
> 
> On Tue, Jun 6, 2017 at 2:33 PM Vincent Risch

Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Hi,

we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
get READ messages dropped:
> READ messages were dropped in last 5000 ms: 974 for internal timeout
> and 0 for cross node timeout
Looking at the logs, some are logged at the same time as Old Gen GCs.
These GCs all take around 4 to 6s to run. To me, it's "normal" that
these could cause reads to be dropped.However, we also have reads dropped 
without Old Gen GCs occurring, only
Young Gen.
I'm wondering if anyone has a good way of determining what the _root_
cause could be. Up until now, the only way we managed to decrease load
on our cluster was by guessing some stuff, trying it out and being
lucky, essentially. I'd love a way to make sure what the problem is
before tackling it. Doing schema changes is not a problem, but changing
stuff blindly is not super efficient :)
What I do see in the logs, is that these happen almost exclusively when
we do a lot of SELECT.  The time logged almost always correspond to
times where our schedules SELECTs are happening. That narrows the scope
a little, but still.
Anyway, I'd appreciate any information about troubleshooting this
scenario.Thanks.


New command line client for cassandra-reaper

2017-03-03 Thread Vincent Rischmann
Hi,



I'm using cassandra-reaper
(https://github.com/thelastpickle/cassandra-reaper) to manage repairs of
my Cassandra clusters, probably like a bunch of other people.


When I started using it (it was still the version from the spotify
repository) the UI didn't work well, and the Python cli client was slow
to use because we had to use Docker to run it.


It was frustrating for me so over a couple of weeks I wrote
https://github.com/vrischmann/happyreaper which is another CLI client.


It doesn't do much more than spreaper (there are some client-side
filters that spreaper doesn't have I think), the main benefit is that
it's a self-contained binary without needing a Python environment.


I'm announcing it here in case it's of interest to anyone. If anyone has
feedback feel free to share.


Vincent.

 


Re: Which compaction strategy when modeling a dumb set

2017-02-27 Thread Vincent Rischmann
No I don't store events in Cassandra.



The real thing I'm doing is couting stuff: each event has a type, a user
associated with it, some other metadata. When I process an event I need
to increment those counters only if the event hasn't already been
processed. Our input event stream is Kafka and it's not uncommon that we
get the same event twice, due to our clients app not being reliable.


Right now I haven't found a good solution to this that doesn't involve a
read before write, but I'd love to hear your suggestions




On Mon, Feb 27, 2017, at 12:01 PM, Vladimir Yudovin wrote:

> Do you also store events in Cassandra? If yes, why not to add
> "processed" flag to existing table(s), and fetch non-processed events
> with single SELECT?
> 

> Best regards, Vladimir Yudovin, 

> *Winguzone[1] - Cloud Cassandra Hosting*

> 

> 

>  On Fri, 24 Feb 2017 06:24:09 -0500 *Vincent Rischmann
> <m...@vrischmann.me>* wrote 
> 

>> Hello,

>> 

>> I'm using a table like this:

>> 

>>CREATE TABLE myset (id uuid PRIMARY KEY)

>> 

>> which is basically a set I use for deduplication, id is a unique id
>> for an event, when I process the event I insert the id, and before
>> processing I check if it has already been processed for
>> deduplication.
>> 

>> It works well enough, but I'm wondering which compaction strategy I
>> should use. I expect maybe 1% or less of events will end up
>> duplicated (thus not generating an insert), so the workload will
>> probably be 50% writes 50% read.
>> 

>> Is LCS a good strategy here or should I stick with STCS ?

> 




Links:

  1. https://winguzone.com?from=list


Which compaction strategy when modeling a dumb set

2017-02-24 Thread Vincent Rischmann
Hello,



I'm using a table like this:



   CREATE TABLE myset (id uuid PRIMARY KEY)



which is basically a set I use for deduplication, id is a unique id for
an event, when I process the event I insert the id, and before
processing I check if it has already been processed for deduplication.


It works well enough, but I'm wondering which compaction strategy I
should use. I expect maybe 1% or less of events will end up duplicated
(thus not generating an insert), so the workload will probably be 50%
writes 50% read.


Is LCS a good strategy here or should I stick with STCS ?


Re: One thread pool per repair in nodetool tpstats

2017-02-21 Thread Vincent Rischmann
Ok, thanks Matija.





On Tue, Feb 21, 2017, at 11:43 AM, Matija Gobec wrote:

> They appear for each repair run and disappear when repair run
> finishes.
> 

> On Tue, Feb 21, 2017 at 11:14 AM, Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __

>> Hi,

>> 

>> I upgraded to Cassandra 2.2.8 and noticed something weird in nodetool
>> tpstats:
>> 

>> Pool NameActive   Pending  Completed
>> Blocked  All time blocked
>> MutationStage 0 0  116265693
>> 0 0
>> ReadStage 1 0   56132474
>> 0 0
>> RequestResponseStage  0 0  163640931
>> 0 0
>> ReadRepairStage   0 03152856
>> 0 0
>> CounterMutationStage  0 0 630690
>> 0 0
>> Repair#26 1 4  1
>> 0 0
>> Repair#48 1 2  3
>> 0 0
>> HintedHandoff 1 1   1198
>> 0 0
>> MiscStage 0 0  0
>> 0 0
>> CompactionExecutor0 0 111438
>> 0 0
>> Repair#45 1 4  1
>> 0 0
>> MemtableReclaimMemory 0 0   3399
>> 0 0
>> Repair#30 1 4  1
>> 0 0
>> PendingRangeCalculator0 0 37
>> 0 0
>> Repair#61 1 4  1
>> 0
>> 

>> There are multiples "pools" named Repair# which
>> weren't there with Cassandra 2.1.16. These appear in the JMX
>> metrics too.
>> 

>> Do they go away eventually ? because this is making the tpstats
>> output harder to read in my opinion


One thread pool per repair in nodetool tpstats

2017-02-21 Thread Vincent Rischmann
Hi,



I upgraded to Cassandra 2.2.8 and noticed something weird in
nodetool tpstats:


Pool NameActive   Pending  Completed   Blocked
All time blocked
MutationStage 0 0  116265693
0 0
ReadStage 1 0   56132474
0 0
RequestResponseStage  0 0  163640931
0 0
ReadRepairStage   0 03152856
0 0
CounterMutationStage  0 0 630690
0 0
Repair#26 1 4  1
0 0
Repair#48 1 2  3
0 0
HintedHandoff 1 1   1198
0 0
MiscStage 0 0  0
0 0
CompactionExecutor0 0 111438
0 0
Repair#45 1 4  1
0 0
MemtableReclaimMemory 0 0   3399
0 0
Repair#30 1 4  1
0 0
PendingRangeCalculator0 0 37
0 0
Repair#61 1 4  1 0


There are multiples "pools" named Repair# which weren't
there with Cassandra 2.1.16. These appear in the JMX metrics too.


Do they go away eventually ? because this is making the tpstats output
harder to read in my opinion


Re: Out of memory and/or OOM kill on a cluster

2016-11-22 Thread Vincent Rischmann
Thanks for the detailed answer Alexander.



We'll look into your suggestions, it's definitely helpful. We have plans
to reduce tombstones and remove the table with the big partitions,
hopefully after we've done that the cluster will be stable again.


Thanks again.





On Tue, Nov 22, 2016, at 09:03 AM, Alexander Dejanovski wrote:

> Hi Vincent, 

> 

> Here are a few pointers for disabling swap : 

> - 
> https://docs.datastax.com/en/cassandra/2.0/cassandra/install/installRecommendSettings.html
> - 
> http://stackoverflow.com/questions/22988824/why-swap-needs-to-be-turned-off-in-datastax-cassandra
> 

> Tombstones are definitely the kind of object that can clutter your
> heap, lead to frequent GC pauses and could be part of why you run into
> OOM from time to time. I cannot answer for sure though as it is a bit
> more complex than that actually.
> You do not have crazy high GC pauses, although a 5s pause should not
> happen on a healthy cluster.
> 

> Getting back to big partitions, I've had the case in production where
> a multi GB partition was filling a 26GB G1 heap when being compacted.
> Eventually, the old gen took all the available space in the heap,
> leaving no room for the young gen, but it actually never OOMed. To be
> honest, I would have preferred an OOM to the inefficient 50s GC pauses
> we've had because such a slow node can (and did) affect the whole
> cluster.
> 

> I think you may have a combination of things happening here and you
> should work on improving them all :
> - spot precisely which are your big partitions to understand why you
>   have some (data modeling issue or data source bad behavior) : look
>   for "large partition" warnings in the cassandra logs, it will give
>   you the partition key
> - try to reduce the number of tombstones you're reading by changing
>   your queries or data model, or maybe by setting up an aggressive
>   tombstone pruning strategy :
>   
> http://cassandra.apache.org/doc/latest/operating/compaction.html?highlight=unchecked_tombstone_compaction#common-options
> You could benefit from setting unchecked_tombstone_compaction to true
> and tuning both tombstone_threshold and tombstone_compaction_interval
> - Follow recommended production settings and fully disable swap from
>   your Cassandra nodes
> 

> You might want to scale down from the 20GB heap as the OOM Killer will
> stop your process either way, and it might allow you to have an
> analyzable heap dump. Such a heap dump could tell us if there are lots
> of tombstones there when the JVM dies.
> 

> I hope that's helpful as there is no easy answer here, and the problem
> should be narrowed down by fixing all potential causes.
> 

> Cheers,

> 

> 

> 

> 

> On Mon, Nov 21, 2016 at 5:10 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __

>> Thanks for your answer Alexander.

>> 

>> We're writing constantly to the table, we estimate it's something
>> like 1.5k to 2k writes per second. Some of these requests update a
>> bunch of fields, some update fields + append something to a set.
>> We don't read constantly from it but when we do it's a lot of read,
>> up to 20k reads per second sometimes.
>> For this particular keyspace everything is using the size tiered
>> compaction strategy.
>> 

>>  - Every node is a physical server, has a 8-Core CPU, 32GB of ram and
>>3TB of SSD.
>>  - Java version is 1.8.0_101 for all nodes except one which is using
>>1.8.0_111 (only for about a week I think, before that it used
>>1.8.0_101 too).
>>  - We're using the G1 GC. I looked at the 19th and on that day we
>>had:
>>   - 1505 GCs

>>   - 2 Old Gen GCs which took around 5s each

>>   - the rest are New Gen GCs, with only 1 other 1s. There's 15 to 20
>> GCs which took between 0.4 and 0.7s. The rest is between 250ms
>> and 400ms approximately.
>> Sometimes, there are 3/4/5 GCs in a row in like 2 seconds, each
>> taking between 250ms to 400ms, but it's kinda rare from what I
>> can see.
>>  - So regarding GC logs, I have them enabled, I've got a bunch of
>>gc.log.X files in /var/log/cassandra, but somehow I can't find any
>>log files for certain periods. On one node which crashed this
>>morning I lost like a week of GC logs, no idea what is happening
>>there...
>>  - I'll just put a couple of warnings here, there are around 9k just
>>for today.
>> 

>> WARN  [SharedPool-Worker-8] 2016-11-21 17:02:00,497
>> SliceQueryFilter.java:320 - Read 2001 live and 11129 tombstone cells
>> in foo.install_info for key: foo@IOS:7 (see
>> tombstone_warn_threshold). 2000 column

Re: Out of memory and/or OOM kill on a cluster

2016-11-21 Thread Vincent Rischmann
Thanks for your answer Alexander.



We're writing constantly to the table, we estimate it's something like
1.5k to 2k writes per second. Some of these requests update a bunch of
fields, some update fields + append something to a set.
We don't read constantly from it but when we do it's a lot of read, up
to 20k reads per second sometimes.
For this particular keyspace everything is using the size tiered
compaction strategy.


 - Every node is a physical server, has a 8-Core CPU, 32GB of ram and
   3TB of SSD.
 - Java version is 1.8.0_101 for all nodes except one which is using
   1.8.0_111 (only for about a week I think, before that it used
   1.8.0_101 too).
 - We're using the G1 GC. I looked at the 19th and on that day we had:

  - 1505 GCs

  - 2 Old Gen GCs which took around 5s each

  - the rest are New Gen GCs, with only 1 other 1s. There's 15 to 20 GCs
which took between 0.4 and 0.7s. The rest is between 250ms and 400ms
approximately.
Sometimes, there are 3/4/5 GCs in a row in like 2 seconds, each taking
between 250ms to 400ms, but it's kinda rare from what I can see.
 - So regarding GC logs, I have them enabled, I've got a bunch of
   gc.log.X files in /var/log/cassandra, but somehow I can't find any
   log files for certain periods. On one node which crashed this morning
   I lost like a week of GC logs, no idea what is happening there...
 - I'll just put a couple of warnings here, there are around 9k just
   for today.


WARN  [SharedPool-Worker-8] 2016-11-21 17:02:00,497
SliceQueryFilter.java:320 - Read 2001 live and 11129 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[-]
WARN  [SharedPool-Worker-1] 2016-11-21 17:02:02,559
SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-
]
WARN  [SharedPool-Worker-2] 2016-11-21 17:02:05,286
SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-
]
WARN  [SharedPool-Worker-11] 2016-11-21 17:02:08,860
SliceQueryFilter.java:320 - Read 2001 live and 19966 tombstone cells in
foo.install_info for key: foo@IOS:10 (see tombstone_warn_threshold).
2000 columns were requested, slices=[-]


So, we're guessing this is bad since it's warning us, however does this
have a significant on the heap / GC ? I don't really know.


- cfstats tells me this:



Average live cells per slice (last five minutes): 1458.029594846951

Maximum live cells per slice (last five minutes): 2001.0

Average tombstones per slice (last five minutes): 1108.2466913854232

Maximum tombstones per slice (last five minutes): 22602.0



- regarding swap, it's not disabled anywhere, I must say we never really
  thought about it. Does it provide a significant benefit ?


Thanks for your help, really appreciated !



On Mon, Nov 21, 2016, at 04:13 PM, Alexander Dejanovski wrote:

> Vincent,

> 

> only the 2.68GB partition is out of bounds here, all the others
> (<256MB) shouldn't be much of a problem.
> It could put pressure on your heap if it is often read and/or
> compacted.
> But to answer your question about the 1% harming the cluster, a few
> big partitions can definitely be a big problem depending on your
> access patterns.
> Which compaction strategy are you using on this table ?

> 

> Could you provide/check the following things on a node that crashed
> recently :
>  * Hardware specifications (how many cores ? how much RAM ? Bare metal
>or VMs ?)
>  * Java version
>  * GC pauses throughout a day (grep GCInspector
>/var/log/cassandra/system.log) : check if you have many pauses that
>take more than 1 second
>  * GC logs at the time of a crash (if you don't produce any, you
>should activate them in cassandra-env.sh)
>  * Tombstone warnings in the logs and high number of tombstone read in
>cfstats
>  * Make sure swap is disabled
> 

> Cheers,

> 

> 

> On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __

>> @Vladimir

>> 

>> We tried with 12Gb and 16Gb, the problem appeared eventually too.

>> In this particular cluster we have 143 tables across 2 keyspaces.

>> 

>> @Alexander

>> 

>> We have one table with a max partition of 2.68GB, one of 256 MB, a
>> bunch with the size varying between 10MB to 100MB ~. Then there's the
>> rest with the max lower than 10MB.
>> 

>> On the biggest, the 99% is around 60MB, 98% around 25MB, 95%
>> around 5.5MB.
>> On the one with max of 256MB, the 99% is around 4.6MB, 98%
>> around 2MB.
>> 

>> Cou

Re: Out of memory and/or OOM kill on a cluster

2016-11-21 Thread Vincent Rischmann
@Vladimir



We tried with 12Gb and 16Gb, the problem appeared eventually too.

In this particular cluster we have 143 tables across 2 keyspaces.



@Alexander



We have one table with a max partition of 2.68GB, one of 256 MB, a bunch
with the size varying between 10MB to 100MB ~. Then there's the rest
with the max lower than 10MB.


On the biggest, the 99% is around 60MB, 98% around 25MB, 95%
around 5.5MB.
On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB.



Could the 1% here really have that much impact ? We do write a lot to
the biggest table and read quite often too, however I have no way to
know if that big partition is ever read.




On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote:

> Hi Vincent,

> 

> one of the usual causes of OOMs is very large partitions.

> Could you check your nodetool cfstats output in search of large
> partitions ? If you find one (or more), run nodetool cfhistograms on
> those tables to get a view of the partition sizes distribution.
> 

> Thanks

> 

> On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin
> <vla...@winguzone.com> wrote:
>> __

>> Did you try any value in the range 8-20 (e.g. 60-70% of physical
>> memory).
>> Also how many tables do you have across all keyspaces? Each table can
>> consume minimum 1M of Java heap.
>> 

>> Best regards, Vladimir Yudovin, 

>> *Winguzone[1] - Hosted Cloud Cassandra Launch your cluster in
>> minutes.*
>> 

>> 

>>  On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann
>> <m...@vrischmann.me>* wrote 
>> 

>>> Hello,

>>> 

>>> we have a 8 node Cassandra 2.1.15 cluster at work which is giving us
>>> a lot of trouble lately.
>>> 

>>> The problem is simple: nodes regularly die because of an out of
>>> memory exception or the Linux OOM killer decides to kill the
>>> process.
>>> For a couple of weeks now we increased the heap to 20Gb hoping it
>>> would solve the out of memory errors, but in fact it didn't; instead
>>> of getting out of memory exception the OOM killer killed the JVM.
>>> 

>>> We reduced the heap on some nodes to 8Gb to see if it would work
>>> better, but some nodes crashed again with out of memory exception.
>>> 

>>> I suspect some of our tables are badly modelled, which would cause
>>> Cassandra to allocate a lot of data, however I don't how to prove
>>> that and/or find which table is bad, and which query is responsible.
>>> 

>>> I tried looking at metrics in JMX, and tried profiling using mission
>>> control but it didn't really help; it's possible I missed it because
>>> I have no idea what to look for exactly.
>>> 

>>> Anyone have some advice for troubleshooting this ?

>>> 

>>> Thanks.

> -- 

> -

> Alexander Dejanovski

> France

> @alexanderdeja

> 

> Consultant

> Apache Cassandra Consulting

> http://www.thelastpickle.com[2]




Links:

  1. https://winguzone.com?from=list
  2. http://www.thelastpickle.com/


Out of memory and/or OOM kill on a cluster

2016-11-21 Thread Vincent Rischmann
Hello,



we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a
lot of trouble lately.


The problem is simple: nodes regularly die because of an out of memory
exception or the Linux OOM killer decides to kill the process.
For a couple of weeks now we increased the heap to 20Gb hoping it would
solve the out of memory errors, but in fact it didn't; instead of
getting out of memory exception the OOM killer killed the JVM.


We reduced the heap on some nodes to 8Gb to see if it would work better,
but some nodes crashed again with out of memory exception.


 I suspect some of our tables are badly modelled, which would cause
 Cassandra to allocate a lot of data, however I don't how to prove that
 and/or find which table is bad, and which query is responsible.


I tried looking at metrics in JMX, and tried profiling using mission
control but it didn't really help; it's possible I missed it because I
have no idea what to look for exactly.


Anyone have some advice for troubleshooting this ?



Thanks.


Re: Tools to manage repairs

2016-10-28 Thread Vincent Rischmann
Well I only asked that because I wanted to make sure that we're not
doing it wrong, because that's actually how we query stuff,  we always
provide a cluster key or a range of cluster keys.

But yes, I understand that compactions may suffer and/or there may be
hidden bottlenecks because of big partitions, so it's definitely good to
know, and I'll definitely work on reducing partition sizes.

On Fri, Oct 28, 2016, at 06:32 PM, Edward Capriolo wrote:
>
>
> On Fri, Oct 28, 2016 at 11:21 AM, Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __
>> Doesn't paging help with this ? Also if we select a range via the
>> cluster key we're never really selecting the full partition. Or is
>> that wrong ?
>>
>>
>> On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
>>> Big partitions are an anti-pattern here is why:
>>>
>>> First Cassandra is not an analytic datastore. Sure it has some UDFs
>>> and aggregate UDFs, but the true purpose of the data store is to
>>> satisfy point reads. Operations have strict timeouts:
>>>
>>> # How long the coordinator should wait for read operations to
>>> # complete
>>> read_request_timeout_in_ms: 5000
>>>
>>> # How long the coordinator should wait for seq or index scans to
>>> # complete
>>> range_request_timeout_in_ms: 1
>>>
>>> This means you need to be able to satisfy the operation in 5
>>> seconds. Which is not only the "think time" for 1 server, but if you
>>> are doing a quorum the operation has to complete and compare on 2 or
>>> more servers. Beyond these cutoffs are thread pools which fill up
>>> and start dropping requests once full.
>>>
>>> Something has to give, either functionality or physics. Particularly
>>> the physics of aggregating an ever-growing data set across N
>>> replicas in less than 5 seconds.  How many 2ms point reads will be
>>> blocked by 50 ms queries etc.
>>>
>>> I do not see the technical limitations of big partitions on disk is
>>> the only hurdle to climb here.
>>>
>>>
>>> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski
>>> <a...@thelastpickle.com> wrote:
>>>> Hi Eric,
>>>>
>>>> that would be https://issues.apache.org/jira/browse/CASSANDRA-9754
>>>> by Michael Kjellman and
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11206 by Robert
>>>> Stupp.
>>>> If you haven't seen it yet, Robert's summit talk on big partitions
>>>> is totally worth it :
>>>> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
>>>> Slides :
>>>> http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans
>>>> <john.eric.ev...@gmail.com> wrote:
>>>>> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
>>>>> <a...@thelastpickle.com> wrote:
>>>>> > A few patches are pushing the limits of partition sizes so we
>>>>> > may soon be
>>>>> > more comfortable with big partitions.
>>>>>
>>>>> You don't happen to have Jira links to these handy, do you?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  Eric Evans john.eric.ev...@gmail.com
>>>>>
>>>>
>>>>
>>>> --
>>>> -
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>>
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com[1]
>>>>
>>>>
>>
>
> "Doesn't paging help with this ? Also if we select a range via the
> cluster key we're never really selecting the full partition. Or is
> that wrong ?"
>
> What I am suggestion is that the data store has had this practical
> limitation on size of partition since inception. As a result the
> common use case is not to use it in such a way. For example, the
> compaction manager may not be optimized for this cases, queries
> running across large partitions may cause more contention or lots of
> young gen garbage , queries running across large partitions may occupy
> the slots of the read stage etc.
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201602.mbox/%3CCAJjpQyTS2eaCcRBVa=zmm-hcbx5nf4ovc1enw+sffgwvngo...@mail.gmail.com%3E
>
> I think there is possibly some more "little details" to discover. Not
> in a bad thing. I just do not think it you can hand-waive like a
> specific thing someone is working on now or paging solves it. If it
> was that easy it would be solved by now :)
>


Links:

  1. http://www.thelastpickle.com/


Re: Tools to manage repairs

2016-10-28 Thread Vincent Rischmann
Doesn't paging help with this ? Also if we select a range via the
cluster key we're never really selecting the full partition. Or is
that wrong ?


On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
> Big partitions are an anti-pattern here is why:
>
> First Cassandra is not an analytic datastore. Sure it has some UDFs
> and aggregate UDFs, but the true purpose of the data store is to
> satisfy point reads. Operations have strict timeouts:
>
> # How long the coordinator should wait for read operations to complete
> read_request_timeout_in_ms: 5000
>
> # How long the coordinator should wait for seq or index scans to
> # complete
> range_request_timeout_in_ms: 1
>
> This means you need to be able to satisfy the operation in 5 seconds.
> Which is not only the "think time" for 1 server, but if you are doing
> a quorum the operation has to complete and compare on 2 or more
> servers. Beyond these cutoffs are thread pools which fill up and start
> dropping requests once full.
>
> Something has to give, either functionality or physics. Particularly
> the physics of aggregating an ever-growing data set across N replicas
> in less than 5 seconds.  How many 2ms point reads will be blocked by
> 50 ms queries etc.
>
> I do not see the technical limitations of big partitions on disk is
> the only hurdle to climb here.
>
>
> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski
>  wrote:
>> Hi Eric,
>>
>> that would be
>> https://issues.apache.org/jira/browse/CASSANDRA-9754 by
>> Michael Kjellman and
>> https://issues.apache.org/jira/browse/CASSANDRA-11206 by
>> Robert Stupp.
>> If you haven't seen it yet, Robert's summit talk on big partitions is
>> totally worth it :
>> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
>> Slides :
>> http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016
>>
>> Cheers,
>>
>>
>> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans
>>  wrote:
>>> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
>>>   wrote:
>>>  > A few patches are pushing the limits of partition sizes so we may
>>>  > soon be
>>>  > more comfortable with big partitions.
>>>
>>>  You don't happen to have Jira links to these handy, do you?
>>>
>>>
>>>  --
>>>  Eric Evans john.eric.ev...@gmail.com
>>
>> --
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com[1]
>>


Links:

  1. http://www.thelastpickle.com/


Re: Tools to manage repairs

2016-10-27 Thread Vincent Rischmann
Yeah that particular table is badly designed, I intend to fix it, when
the roadmap allows us to do it :)
What is the recommended maximum partition size ?

Thanks for all the information.


On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote:
> 3.3GB is already too high, and it's surely not good to have well
>   performing compactions. Still I know changing a data model is no
>   easy thing to do, but you should try to do something here.
> Anticompaction is a special type of compaction and if an sstable is
> being anticompacted, then any attempt to run validation compaction on
> it will fail, telling you that you cannot have an sstable being part
> of 2 repair sessions at the same time, so incremental repair must be
> run one node at a time, waiting for anticompactions to end before
> moving from one node to the other.
> Be mindful of running incremental repair on a regular basis once you
> started as you'll have two separate pools of sstables (repaired and
> unrepaired) that won't get compacted together, which could be a
> problem if you want tombstones to be purged efficiently.
> Cheers,
>
> Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <m...@vrischmann.me>
> a écrit :
>> __
>> Ok, I think we'll give incremental repairs a try on a limited number
>> of CFs first and then if it goes well we'll progressively switch more
>> CFs to incremental.
>>
>> I'm not sure I understand the problem with anticompaction and
>> validation running concurrently. As far as I can tell, right now when
>> a CF is repaired (either via reaper, or via nodetool) there may be
>> compactions running at the same time. In fact, it happens very often.
>> Is it a problem ?
>>
>> As far as big partitions, the biggest one we have is around 3.3Gb.
>> Some less big partitions are around 500Mb and less.
>>
>>
>> On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
>>> Oh right, that's what they advise :)
>>> I'd say that you should skip the full repair phase in the migration
>>> procedure as that will obviously fail, and just mark all sstables as
>>> repaired (skip 1, 2 and 6).
>>> Anyway you can't do better, so take a leap of faith there.
>>>
>>> Intensity is already very low and 1 segments is a whole lot for
>>> 9 nodes, you should not need that many.
>>>
>>> You can definitely pick which CF you'll run incremental repair on,
>>> and still run full repair on the rest.
>>> If you pick our Reaper fork, watch out for schema changes that add
>>> incremental repair fields, and I do not advise to run incremental
>>> repair without it, otherwise you might have issues with
>>> anticompaction and validation compactions running concurrently from
>>> time to time.
>>>
>>> One last thing : can you check if you have particularly big
>>> partitions in the CFs that fail to get repaired ? You can run
>>> nodetool cfhistograms to check that.
>>>
>>> Cheers,
>>>
>>>
>>>
>>> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me>
>>> wrote:
>>>> __
>>>> Thanks for the response.
>>>>
>>>> We do break up repairs between tables, we also tried our best to
>>>> have no overlap between repair runs. Each repair has 1 segments
>>>> (purely arbitrary number, seemed to help at the time). Some runs
>>>> have an intensity of 0.4, some have as low as 0.05.
>>>>
>>>> Still, sometimes one particular app (which does a lot of
>>>> read/modify/write batches in quorum) gets slowed down to the point
>>>> we have to stop the repair run.
>>>>
>>>> But more annoyingly, since 2 to 3 weeks as I said, it looks like
>>>> runs don't progress after some time. Every time I restart reaper,
>>>> it starts to repair correctly again, up until it gets stuck. I have
>>>> no idea why that happens now, but it means I have to baby sit
>>>> reaper, and it's becoming annoying.
>>>>
>>>> Thanks for the suggestion about incremental repairs. It would
>>>> probably be a good thing but it's a little challenging to setup I
>>>> think. Right now running a full repair of all keyspaces (via
>>>> nodetool repair) is going to take a lot of time, probably like 5
>>>> days or more. We were never able to run one to completion. I'm not
>>>> sure it's a good idea to disable autocompaction for that long.
>>>>
>>>> But maybe I'm wrong. Is it possible to use incremental repair

Re: Tools to manage repairs

2016-10-27 Thread Vincent Rischmann
Ok, I think we'll give incremental repairs a try on a limited number of
CFs first and then if it goes well we'll progressively switch more CFs
to incremental.

I'm not sure I understand the problem with anticompaction and
validation running concurrently. As far as I can tell, right now when a
CF is repaired (either via reaper, or via nodetool) there may be
compactions running at the same time. In fact, it happens very often.
Is it a problem ?

As far as big partitions, the biggest one we have is around 3.3Gb. Some
less big partitions are around 500Mb and less.


On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
> Oh right, that's what they advise :)
> I'd say that you should skip the full repair phase in the migration
> procedure as that will obviously fail, and just mark all sstables as
> repaired (skip 1, 2 and 6).
> Anyway you can't do better, so take a leap of faith there.
>
> Intensity is already very low and 1 segments is a whole lot for 9
> nodes, you should not need that many.
>
> You can definitely pick which CF you'll run incremental repair on, and
> still run full repair on the rest.
> If you pick our Reaper fork, watch out for schema changes that add
> incremental repair fields, and I do not advise to run incremental
> repair without it, otherwise you might have issues with anticompaction
> and validation compactions running concurrently from time to time.
>
> One last thing : can you check if you have particularly big partitions
> in the CFs that fail to get repaired ? You can run nodetool
> cfhistograms to check that.
>
> Cheers,
>
>
>
> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __
>> Thanks for the response.
>>
>> We do break up repairs between tables, we also tried our best to have
>> no overlap between repair runs. Each repair has 1 segments
>> (purely arbitrary number, seemed to help at the time). Some runs have
>> an intensity of 0.4, some have as low as 0.05.
>>
>> Still, sometimes one particular app (which does a lot of
>> read/modify/write batches in quorum) gets slowed down to the point we
>> have to stop the repair run.
>>
>> But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
>> don't progress after some time. Every time I restart reaper, it
>> starts to repair correctly again, up until it gets stuck. I have no
>> idea why that happens now, but it means I have to baby sit reaper,
>> and it's becoming annoying.
>>
>> Thanks for the suggestion about incremental repairs. It would
>> probably be a good thing but it's a little challenging to setup I
>> think. Right now running a full repair of all keyspaces (via nodetool
>> repair) is going to take a lot of time, probably like 5 days or more.
>> We were never able to run one to completion. I'm not sure it's a good
>> idea to disable autocompaction for that long.
>>
>> But maybe I'm wrong. Is it possible to use incremental repairs on
>> some column family only ?
>>
>>
>> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>>> Hi Vincent,
>>>
>>> most people handle repair with :
>>> - pain (by hand running nodetool commands)
>>> - cassandra range repair :
>>>   https://github.com/BrianGallew/cassandra_range_repair
>>> - Spotify Reaper
>>> - and OpsCenter repair service for DSE users
>>>
>>> Reaper is a good option I think and you should stick to it. If it
>>> cannot do the job here then no other tool will.
>>>
>>> You have several options from here :
>>>  * Try to break up your repair table by table and see which ones
>>>actually get stuck
>>>  * Check your logs for any repair/streaming error
>>>  * Avoid repairing everything :
>>>* you may have expendable tables
>>>* you may have TTLed only tables with no deletes, accessed with
>>>  QUORUM CL only
>>>  * You can try to relieve repair pressure in Reaper by lowering
>>>repair intensity (on the tables that get stuck)
>>>  * You can try adding steps to your repair process by putting a
>>>higher segment count in reaper (on the tables that get stuck)
>>>  * And lastly, you can turn to incremental repair. As you're
>>>familiar with Reaper already, you might want to take a look at
>>>our Reaper fork that handles incremental repair :
>>>https://github.com/thelastpickle/cassandra-reaper If you go down
>>>that way, make sure you first mark all sstables as repaired
>>>before you run your first incremental repair, otherwise you'll
&g

Re: Tools to manage repairs

2016-10-27 Thread Vincent Rischmann
Thanks for the response.

We do break up repairs between tables, we also tried our best to have no
overlap between repair runs. Each repair has 1 segments (purely
arbitrary number, seemed to help at the time). Some runs have an
intensity of 0.4, some have as low as 0.05.

Still, sometimes one particular app (which does a lot of
read/modify/write batches in quorum) gets slowed down to the point we
have to stop the repair run.

But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
don't progress after some time. Every time I restart reaper, it starts
to repair correctly again, up until it gets stuck. I have no idea why
that happens now, but it means I have to baby sit reaper, and it's
becoming annoying.

Thanks for the suggestion about incremental repairs. It would probably
be a good thing but it's a little challenging to setup I think. Right
now running a full repair of all keyspaces (via nodetool repair) is
going to take a lot of time, probably like 5 days or more. We were never
able to run one to completion. I'm not sure it's a good idea to disable
autocompaction for that long.

But maybe I'm wrong. Is it possible to use incremental repairs on some
column family only ?


On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
>   https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it
> cannot do the job here then no other tool will.
>
> You have several options from here :
>  * Try to break up your repair table by table and see which ones
>actually get stuck
>  * Check your logs for any repair/streaming error
>  * Avoid repairing everything :
>* you may have expendable tables
>* you may have TTLed only tables with no deletes, accessed with
>  QUORUM CL only
>  * You can try to relieve repair pressure in Reaper by lowering repair
>intensity (on the tables that get stuck)
>  * You can try adding steps to your repair process by putting a higher
>segment count in reaper (on the tables that get stuck)
>  * And lastly, you can turn to incremental repair. As you're familiar
>with Reaper already, you might want to take a look at our Reaper
>fork that handles incremental repair :
>https://github.com/thelastpickle/cassandra-reaper If you go down
>that way, make sure you first mark all sstables as repaired before
>you run your first incremental repair, otherwise you'll end up in
>anticompaction hell (bad bad place) :
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>Even if people say that's not necessary anymore, it'll save you
>from a very bad first experience with incremental repair.
>Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __
>> Hi,
>>
>> we have two Cassandra 2.1.15 clusters at work and are having some
>> trouble with repairs.
>>
>> Each cluster has 9 nodes, and the amount of data is not gigantic but
>> some column families have 300+Gb of data.
>> We tried to use `nodetool repair` for these tables but at the time we
>> tested it, it made the whole cluster load too much and it impacted
>> our production apps.
>>
>> Next we saw https://github.com/spotify/cassandra-reaper , tried it
>> and had some success until recently. Since 2 to 3 weeks it never
>> completes a repair run, deadlocking itself somehow.
>>
>> I know DSE includes a repair service but I'm wondering how do other
>> Cassandra users manage repairs ?
>>
>> Vincent.
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[1]


Links:

  1. http://www.thelastpickle.com/


Tools to manage repairs

2016-10-27 Thread Vincent Rischmann
Hi,

we have two Cassandra 2.1.15 clusters at work and are having some
trouble with repairs.

Each cluster has 9 nodes, and the amount of data is not gigantic but
some column families have 300+Gb of data.
We tried to use `nodetool repair` for these tables but at the time we
tested it, it made the whole cluster load too much and it impacted our
production apps.

Next we saw https://github.com/spotify/cassandra-reaper , tried it and
had some success until recently. Since 2 to 3 weeks it never completes a
repair run, deadlocking itself somehow.

I know DSE includes a repair service but I'm wondering how do other
Cassandra users manage repairs ?

Vincent.