Thanks for the explanation. It should deserve a blog post Sergio
On Wed, Jan 22, 2020, 1:22 PM Reid Pinchback <rpinchb...@tripadvisor.com> wrote: > The reaper logs will say if nodes are being skipped. The web UI isn’t > that good at making it apparent. You can sometimes tell it is likely > happening when you see time gaps between parts of the repair. This is for > when nodes are skipped because of a timeout, but not only that. The gaps > are mostly controlled by the combined results of segmentCountPerNode, > repairIntensity, and hangingRepairTimeoutMins. The last of those three is > the most obvious influence on timeouts, but the other two have some impact > on the work attempted and the size of the time gaps. However the C* > version also has some bearing, as it influences how hard it is to process > the data needed for repairs. > > > > The more subtle aspect of node skipping isn’t the hanging repairs. When > repair of a token range is first attempted, Reaper uses JMX to ask C* if a > repair is already underway. The way it asks is very simplistic, so it > doesn’t mean a repair is underway for that particular token range. It just > means something looking like a repair is going on. Basically it just asks > “hey is there a thread with the right magic naming pattern?” The problem I > think is that when you get some repair activity triggered on reads and > writes for inconsistent data, I believe they show up as these kinds of > threads too. If you have a bad usage pattern of C* (where you write then > very soon read back) then logically you’d expect this to happen quite a lot. > > > > I’m not an expert on the internals since I’m not one of the C* > contributors, but having stared at that part of the source quite a bit this > year, that’s my take on what can happen. And if I’m correct, that’s not a > thing you can tune for. It is a consequence of C*-unfriendly usage patterns. > > > > Bottom line though is that tuning repairs is only something you do if you > find that repairs are taking longer than makes sense to you. It’s totally > separate from the notion that you should be able to run reaper-controlled > repairs at least 2x per gc grace seconds. That’s just a case of making > some observations on the arithmetic of time intervals. > > > > > > *From: *Sergio <lapostadiser...@gmail.com> > *Date: *Wednesday, January 22, 2020 at 4:08 PM > *To: *Reid Pinchback <rpinchb...@tripadvisor.com> > *Cc: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: Is there any concern about increasing gc_grace_seconds > from 5 days to 8 days? > > > > *Message from External Sender* > > Thank you very much for your extended response. > > Should I look in the log some particular message to detect such behavior? > > How do you tune it ? > > > > Thanks, > > > > Sergio > > > > On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback <rpinchb...@tripadvisor.com> > wrote: > > Kinda. It isn’t that you have to repair twice per se, just that the > possibility of running repairs at least twice before GC grace seconds > elapse means that clearly there is no chance of a tombstone not being > subject to repair at least once before you hit your GC grace seconds. > > > > Imagine a tombstone being created on the very first node that Reaper > looked at in a repair cycle, but one second after Reaper completed repair > of that particular token range. Repairs will be complete, but that > particular tombstone just missed being part of the effort. > > > > Now your next repair run happens. What if Reaper doesn’t look at that > same node first? It is easy to have happen, as there is a bunch of logic > related to detection of existing repairs or things taking too long. So the > box that was “the first node” in that first repair run, through bad luck > gets kicked down to later in the second run. I’ve seen nodes get skipped > multiple times (you can tune to reduce that, but bottom line… it happens). > So, bad luck you’ve got. Eventually the node does get repaired, and the > aging tombstone finally gets removed. All fine and dandy… > > > > Provided that the second repair run got to that point BEFORE you hit your > GC grace seconds. > > > > That’s why you need enough time to run it twice. Because you need enough > time to catch the oldest possible tombstone, even if it is dealt with at > the very end of a repair run. Yes, it sounds like a bit of a degenerate > case, but if you are writing a lot of data, the probability of not having > the degenerate cases become real cases becomes vanishingly small. > > > > R > > > > > > *From: *Sergio <lapostadiser...@gmail.com> > *Date: *Wednesday, January 22, 2020 at 1:41 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Reid > Pinchback <rpinchb...@tripadvisor.com> > *Subject: *Re: Is there any concern about increasing gc_grace_seconds > from 5 days to 8 days? > > > > *Message from External Sender* > > I was wondering if I should always complete 2 repairs cycles with reaper > even if one repair cycle finishes in 7 hours. > > Currently, I have around 200GB in column family data size to be repaired > and I was scheduling once repair a week and I was not having too much > stress on my 8 nodes cluster with i3xlarge nodes. > > Thanks, > > Sergio > > > > Il giorno mer 22 gen 2020 alle ore 08:28 Sergio <lapostadiser...@gmail.com> > ha scritto: > > Thank you very much! Yes I am using reaper! > > > > Best, > > > > Sergio > > > > On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback <rpinchb...@tripadvisor.com> > wrote: > > Sergio, if you’re looking for a new frequency for your repairs because of > the change, if you are using reaper, then I’d go for repair_freq <= > gc_grace / 2. > > > > Just serendipity with a conversation I was having at work this morning. > When you actually watch the reaper logs then you can see situations where > unlucky timing with skipped nodes can make the time to remove a tombstone > be up to 2 x repair_run_time. > > > > If you aren’t using reaper, your mileage will vary, particularly if your > repairs are consistent in the ordering across nodes. Reaper can be > moderately non-deterministic hence the need to be sure you can complete at > least two repair runs. > > > > R > > > > *From: *Sergio <lapostadiser...@gmail.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Tuesday, January 21, 2020 at 7:13 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: Is there any concern about increasing gc_grace_seconds > from 5 days to 8 days? > > > > *Message from External Sender* > > Thank you very much for your response. > > The considerations mentioned are the ones that I was expecting. > > I believe that I am good to go. > > I just wanted to make sure that there was no need to run any other extra > command beside that one. > > > > Best, > > > > Sergio > > > > On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa <jji...@gmail.com> wrote: > > Note that if you're actually running repairs within 5 days, and you adjust > this to 8, you may stream a bunch of tombstones across in that 5-8 day > window, which can increase disk usage / compaction (because as you pass 5 > days, one replica may gc away the tombstones, the others may not because > the tombstones shadow data, so you'll re-stream the tombstone to the other > replicas) > > > > On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims <elli...@backblaze.com> > wrote: > > In addition to extra space, queries can potentially be more expensive > because more dead rows and tombstones will need to be scanned. How much of > a difference this makes will depend drastically on the schema and access > pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable. > > > > On Tue, Jan 21, 2020 at 2:14 PM Sergio <lapostadiser...@gmail.com> wrote: > > https://stackoverflow.com/a/22030790 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_22030790&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=qt1NAYTks84VVQ4WGXWkK6pw85m3FcuUjPRJPdIHMdw&s=aEgz5F5HRxPT3w4hpfNXQRhcchwRjrpf7KB3QyywO_Q&e=> > > > > For CQLSH > > alter table <table_name> with GC_GRACE_SECONDS = <seconds>; > > > > > > Il giorno mar 21 gen 2020 alle ore 13:12 Sergio <lapostadiser...@gmail.com> > ha scritto: > > Hi guys! > > I just wanted to confirm with you before doing such an operation. I expect > to increase the space but nothing more than this. I need to perform just : > > UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days > > Is it correct? > > Thanks, > > Sergio > >