Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

Reid Pinchback Wed, 22 Jan 2020 13:52:07 -0800

I have plans to do so in the near-ish future.  People keep adding things to my 
to-do list, and I don’t have something on my to-do list yet saying “stop people 
from adding things to my to-do list”.  😉


Assuming I get to that point, if I answer something and I think something I 
wrote is relevant, I’ll point to it for those who want more details.  Email 
discussion threads, sometimes it is more helpful to say things a bit more 
abbreviated.  Not everybody needs details, many people have more context than I 
do and can fill in the backstory on their own.

R

From: Sergio <lapostadiser...@gmail.com>
Date: Wednesday, January 22, 2020 at 4:46 PM
To: Reid Pinchback <rpinchb...@tripadvisor.com>
Cc: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thanks for the explanation. It should deserve a blog post

Sergio

On Wed, Jan 22, 2020, 1:22 PM Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> wrote:
The reaper logs will say if nodes are being skipped.  The web UI isn’t that 
good at making it apparent.  You can sometimes tell it is likely happening when 
you see time gaps between parts of the repair.  This is for when nodes are 
skipped because of a timeout, but not only that.  The gaps are mostly 
controlled by the combined results of segmentCountPerNode, repairIntensity, and 
hangingRepairTimeoutMins.  The last of those three is the most obvious 
influence on timeouts, but the other two have some impact on the work attempted 
and the size of the time gaps.  However the C* version also has some bearing, 
as it influences how hard it is to process the data needed for repairs.

The more subtle aspect of node skipping isn’t the hanging repairs.  When repair 
of a token range is first attempted, Reaper uses JMX to ask C* if a repair is 
already underway.  The way it asks is very simplistic, so it doesn’t mean a 
repair is underway for that particular token range.  It just means something 
looking like a repair is going on.  Basically it just asks “hey is there a 
thread with the right magic naming pattern?”  The problem I think is that when 
you get some repair activity triggered on reads and writes for inconsistent 
data, I believe they show up as these kinds of threads too.  If you have a bad 
usage pattern of C* (where you write then very soon read back) then logically 
you’d expect this to happen quite a lot.

I’m not an expert on the internals since I’m not one of the C* contributors, 
but having stared at that part of the source quite a bit this year, that’s my 
take on what can happen.  And if I’m correct, that’s not a thing you can tune 
for. It is a consequence of C*-unfriendly usage patterns.

Bottom line though is that tuning repairs is only something you do if you find 
that repairs are taking longer than makes sense to you.  It’s totally separate 
from the notion that you should be able to run reaper-controlled repairs at 
least 2x per gc grace seconds.  That’s just a case of making some observations 
on the arithmetic of time intervals.


From: Sergio <lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>>
Date: Wednesday, January 22, 2020 at 4:08 PM
To: Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>>
Cc: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your extended response.
Should I look in the log some particular message to detect such behavior?
How do you tune it ?

Thanks,

Sergio

On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> wrote:
Kinda. It isn’t that you have to repair twice per se, just that the possibility 
of running repairs at least twice before GC grace seconds elapse means that 
clearly there is no chance of a tombstone not being subject to repair at least 
once before you hit your GC grace seconds.

Imagine a tombstone being created on the very first node that Reaper looked at 
in a repair cycle, but one second after Reaper completed repair of that 
particular token range.  Repairs will be complete, but that particular 
tombstone just missed being part of the effort.

Now your next repair run happens.  What if Reaper doesn’t look at that same 
node first?  It is easy to have happen, as there is a bunch of logic related to 
detection of existing repairs or things taking too long.  So the box that was 
“the first node” in that first repair run, through bad luck gets kicked down to 
later in the second run.  I’ve seen nodes get skipped multiple times (you can 
tune to reduce that, but bottom line… it happens).  So, bad luck you’ve got.  
Eventually the node does get repaired, and the aging tombstone finally gets 
removed.  All fine and dandy…

Provided that the second repair run got to that point BEFORE you hit your GC 
grace seconds.

That’s why you need enough time to run it twice.  Because you need enough time 
to catch the oldest possible tombstone, even if it is dealt with at the very 
end of a repair run.  Yes, it sounds like a bit of a degenerate case, but if 
you are writing a lot of data, the probability of not having the degenerate 
cases become real cases becomes vanishingly small.

R


From: Sergio <lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>>
Date: Wednesday, January 22, 2020 at 1:41 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>, Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
I was wondering if I should always complete 2 repairs cycles with reaper even 
if one repair cycle finishes in 7 hours.

Currently, I have around 200GB in column family data size to be repaired and I 
was scheduling once repair a week and I was not having too much stress on my 8 
nodes cluster with i3xlarge nodes.

Thanks,

Sergio

Il giorno mer 22 gen 2020 alle ore 08:28 Sergio 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> ha scritto:
Thank you very much! Yes I am using reaper!

Best,

Sergio

On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> wrote:
Sergio, if you’re looking for a new frequency for your repairs because of the 
change, if you are using reaper, then I’d go for repair_freq <= gc_grace / 2.

Just serendipity with a conversation I was having at work this morning.  When 
you actually watch the reaper logs then you can see situations where unlucky 
timing with skipped nodes can make the time to remove a tombstone be up to 2 x 
repair_run_time.

If you aren’t using reaper, your mileage will vary, particularly if your 
repairs are consistent in the ordering across nodes.  Reaper can be moderately 
non-deterministic hence the need to be sure you can complete at least two 
repair runs.

R

From: Sergio <lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 21, 2020 at 7:13 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your response.
The considerations mentioned are the ones that I was expecting.
I believe that I am good to go.
I just wanted to make sure that there was no need to run any other extra 
command beside that one.

Best,

Sergio

On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa 
<jji...@gmail.com<mailto:jji...@gmail.com>> wrote:
Note that if you're actually running repairs within 5 days, and you adjust this 
to 8, you may stream a bunch of tombstones across in that 5-8 day window, which 
can increase disk usage / compaction (because as you pass 5 days, one replica 
may gc away the tombstones, the others may not because the tombstones shadow 
data, so you'll re-stream the tombstone to the other replicas)

On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
<elli...@backblaze.com<mailto:elli...@backblaze.com>> wrote:
In addition to extra space, queries can potentially be more expensive because 
more dead rows and tombstones will need to be scanned.  How much of a 
difference this makes will depend drastically on the schema and access pattern, 
but I wouldn't expect going from 5 days to 8 to be very noticeable.

On Tue, Jan 21, 2020 at 2:14 PM Sergio 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> wrote:
https://stackoverflow.com/a/22030790<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_22030790&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=qt1NAYTks84VVQ4WGXWkK6pw85m3FcuUjPRJPdIHMdw&s=aEgz5F5HRxPT3w4hpfNXQRhcchwRjrpf7KB3QyywO_Q&e=>

For CQLSH

alter table <table_name> with GC_GRACE_SECONDS = <seconds>;



Il giorno mar 21 gen 2020 alle ore 13:12 Sergio 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> ha scritto:
Hi guys!

I just wanted to confirm with you before doing such an operation. I expect to 
increase the space but nothing more than this. I  need to perform just :

UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
Is it correct?

Thanks,

Sergio

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

Reply via email to