Re: Tools to manage repairs

Jeff Jirsa Thu, 27 Oct 2016 14:24:42 -0700

If you go above ~1GB, the primary symptom you’ll see is a LOT of garbage 
created on reads (CASSANDRA-9754 details this).


 

As redesigning data model is often expensive (engineering time, reloading data, 
etc), one workaround is to tune your JVM to better handle situations where you 
create a lot of trash. One method that can help work around this is to use a 
much larger eden size than default – up to 50% of your total heap size.

 

For example, if you were using 8G heap and 2G eden, going  to 3G or 4G eden 
(new heap size in cassandra-env.sh) MAY work better for you if you’re reading 
from large partitions (it can also crash your server in some cases, so TEST IT 
IN A LAB FIRST).

 

-          Jeff

 

From: Alexander Dejanovski <a...@thelastpickle.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, October 27, 2016 at 2:13 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Tools to manage repairs

 

The "official" recommendation would be 100MB, but it's hard to give a precise 
answer.
Keeping it under the GB seems like a good target.
A few patches are pushing the limits of partition sizes so we may soon be more 
comfortable with big partitions.

Cheers

 

Le jeu. 27 oct. 2016 21:28, Vincent Rischmann <m...@vrischmann.me> a écrit :

Yeah that particular table is badly designed, I intend to fix it, when the 
roadmap allows us to do it :)

What is the recommended maximum partition size ?

 

Thanks for all the information.

 

 

On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote:

3.3GB is already too high, and it's surely not good to have well performing 
compactions. Still I know changing a data model is no easy thing to do, but you 
should try to do something here.

Anticompaction is a special type of compaction and if an sstable is being 
anticompacted, then any attempt to run validation compaction on it will fail, 
telling you that you cannot have an sstable being part of 2 repair sessions at 
the same time, so incremental repair must be run one node at a time, waiting 
for anticompactions to end before moving from one node to the other.

Be mindful of running incremental repair on a regular basis once you started as 
you'll have two separate pools of sstables (repaired and unrepaired) that won't 
get compacted together, which could be a problem if you want tombstones to be 
purged efficiently.

Cheers,

 

Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <m...@vrischmann.me> a écrit :

 

Ok, I think we'll give incremental repairs a try on a limited number of CFs 
first and then if it goes well we'll progressively switch more CFs to 
incremental.

 

I'm not sure I understand the problem with anticompaction and validation 
running concurrently. As far as I can tell, right now when a CF is repaired 
(either via reaper, or via nodetool) there may be compactions running at the 
same time. In fact, it happens very often. Is it a problem ?

 

As far as big partitions, the biggest one we have is around 3.3Gb. Some less 
big partitions are around 500Mb and less.

 

 

On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:

Oh right, that's what they advise :)

I'd say that you should skip the full repair phase in the migration procedure 
as that will obviously fail, and just mark all sstables as repaired (skip 1, 2 
and 6).

Anyway you can't do better, so take a leap of faith there.

 

Intensity is already very low and 10000 segments is a whole lot for 9 nodes, 
you should not need that many.

 

You can definitely pick which CF you'll run incremental repair on, and still 
run full repair on the rest.

If you pick our Reaper fork, watch out for schema changes that add incremental 
repair fields, and I do not advise to run incremental repair without it, 
otherwise you might have issues with anticompaction and validation compactions 
running concurrently from time to time.

 

One last thing : can you check if you have particularly big partitions in the 
CFs that fail to get repaired ? You can run nodetool cfhistograms to check that.

 

Cheers,

 

 

 

On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me> wrote:

 

Thanks for the response.

 

We do break up repairs between tables, we also tried our best to have no 
overlap between repair runs. Each repair has 10000 segments (purely arbitrary 
number, seemed to help at the time). Some runs have an intensity of 0.4, some 
have as low as 0.05.

 

Still, sometimes one particular app (which does a lot of read/modify/write 
batches in quorum) gets slowed down to the point we have to stop the repair run.

 

But more annoyingly, since 2 to 3 weeks as I said, it looks like runs don't 
progress after some time. Every time I restart reaper, it starts to repair 
correctly again, up until it gets stuck. I have no idea why that happens now, 
but it means I have to baby sit reaper, and it's becoming annoying.

 

Thanks for the suggestion about incremental repairs. It would probably be a 
good thing but it's a little challenging to setup I think. Right now running a 
full repair of all keyspaces (via nodetool repair) is going to take a lot of 
time, probably like 5 days or more. We were never able to run one to 
completion. I'm not sure it's a good idea to disable autocompaction for that 
long.

 

But maybe I'm wrong. Is it possible to use incremental repairs on some column 
family only ?

 

 

On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:

Hi Vincent,

 

most people handle repair with : 

- pain (by hand running nodetool commands)

- cassandra range repair : https://github.com/BrianGallew/cassandra_range_repair

- Spotify Reaper

- and OpsCenter repair service for DSE users

 

Reaper is a good option I think and you should stick to it. If it cannot do the 
job here then no other tool will.

 

You have several options from here : 
Try to break up your repair table by table and see which ones actually get stuck
Check your logs for any repair/streaming error
Avoid repairing everything : 
you may have expendable tables 
you may have TTLed only tables with no deletes, accessed with QUORUM CL only
You can try to relieve repair pressure in Reaper by lowering repair intensity 
(on the tables that get stuck)
You can try adding steps to your repair process by putting a higher segment 
count in reaper (on the tables that get stuck)
And lastly, you can turn to incremental repair. As you're familiar with Reaper 
already, you might want to take a look at our Reaper fork that handles 
incremental repair : https://github.com/thelastpickle/cassandra-reaper
If you go down that way, make sure you first mark all sstables as repaired 
before you run your first incremental repair, otherwise you'll end up in 
anticompaction hell (bad bad place) : 
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html

Even if people say that's not necessary anymore, it'll save you from a very bad 
first experience with incremental repair.

Furthermore, make sure you run repair daily after your first inc repair run, in 
order to work on small sized repairs.

 

Cheers,

 

 

On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <m...@vrischmann.me> wrote:

 

Hi,

 

we have two Cassandra 2.1.15 clusters at work and are having some trouble with 
repairs.

 

Each cluster has 9 nodes, and the amount of data is not gigantic but some 
column families have 300+Gb of data.

We tried to use `nodetool repair` for these tables but at the time we tested 
it, it made the whole cluster load too much and it impacted our production apps.

 

Next we saw https://github.com/spotify/cassandra-reaper , tried it and had some 
success until recently. Since 2 to 3 weeks it never completes a repair run, 
deadlocking itself somehow.

 

I know DSE includes a repair service but I'm wondering how do other Cassandra 
users manage repairs ?

 

Vincent.

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.

smime.p7s
Description: S/MIME cryptographic signature

Re: Tools to manage repairs

Reply via email to