Thanks Alexander, After roll restart the blocked repair job stopped and I was able to run repair again.
Regards, Robert Robert Sicoie On Wed, Sep 28, 2016 at 6:46 PM, Alexander Dejanovski < a...@thelastpickle.com> wrote: > Robert, > > You can restart them in any order, that doesn't make a difference afaik. > > Cheers > > Le mer. 28 sept. 2016 17:10, Robert Sicoie <robert.sic...@gmail.com> a > écrit : > >> Thanks Alexander, >> >> Yes, with tpstats I can see the hanging active repair(s) (output >> attached). For one there are 31 pending repair. On others there are less >> pending repairs (min 12). Is there any recomandation for the restart order? >> The one with more less pending repairs first, perhaps? >> >> Thanks, >> Robert >> >> Robert Sicoie >> >> On Wed, Sep 28, 2016 at 5:35 PM, Alexander Dejanovski < >> a...@thelastpickle.com> wrote: >> >>> They will show up in nodetool compactionstats : >>> https://issues.apache.org/jira/browse/CASSANDRA-9098 >>> >>> Did you check nodetool tpstats to see if you didn't have any running >>> repair session ? >>> Just to make sure (and if you can actually do it), roll restart the >>> cluster and try again. Repair sessions can get sticky sometimes. >>> >>> On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sic...@gmail.com> >>> wrote: >>> >>>> I am using nodetool compactionstats to check for pending compactions >>>> and it shows me 0 pending on all nodes, seconds before running nodetool >>>> repair. >>>> I am also monitoring PendingCompactions on jmx. >>>> >>>> Is there other way I can find out if is there any anticompaction >>>> running on any node? >>>> >>>> Thanks a lot, >>>> Robert >>>> >>>> Robert Sicoie >>>> >>>> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski < >>>> a...@thelastpickle.com> wrote: >>>> >>>>> Robert, >>>>> >>>>> you need to make sure you have no repair session currently running on >>>>> your cluster, and no anticompaction. >>>>> I'd recommend doing a rolling restart in order to stop all running >>>>> repair for sure, then start the process again, node by node, checking that >>>>> no anticompaction is running before moving from one node to the other. >>>>> >>>>> Please do not use the -pr switch as it is both useless (token ranges >>>>> are repaired only once with inc repair, whatever the replication factor) >>>>> and harmful as all anticompactions won't be executed (you'll still have >>>>> sstables marked as unrepaired even if the process has ran entirely with no >>>>> error). >>>>> >>>>> Let us know how that goes. >>>>> >>>>> Cheers, >>>>> >>>>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sic...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Alexander, >>>>>> >>>>>> Now I started to run the repair with -pr arg and with keyspace and >>>>>> table args. >>>>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288 >>>>>> RepairRunnable.java:246 - Repair session >>>>>> 89af4d10-856f-11e6-b28f-df99132d7979 >>>>>> for range [(8323429577695061526,8326640819362122791], >>>>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed >>>>>> in /10.45.113.88" >>>>>> >>>>>> for one of the tables. 10.45.113.88 is the ip of the machine I am >>>>>> running the nodetool on. >>>>>> I'm wondering if this is normal... >>>>>> >>>>>> Thanks, >>>>>> Robert >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Robert Sicoie >>>>>> >>>>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski < >>>>>> a...@thelastpickle.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> nodetool scrub won't help here, as what you're experiencing is most >>>>>>> likely that one SSTable is going through anticompaction, and then >>>>>>> another >>>>>>> node is asking for a Merkle tree that involves it. >>>>>>> For understandable reasons, an SSTable cannot be anticompacted and >>>>>>> validation compacted at the same time. >>>>>>> >>>>>>> The solution here is to adjust the repair pressure on your cluster >>>>>>> so that anticompaction can end before you run repair on another node. >>>>>>> You may have a lot of anticompaction to do if you had high volumes >>>>>>> of unrepaired data, which can take a long time depending on several >>>>>>> factors. >>>>>>> >>>>>>> You can tune your repair process to make sure no anticompaction is >>>>>>> running before launching a new session on another node or you can try my >>>>>>> Reaper fork that handles incremental repair : https://github.com/ >>>>>>> adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui >>>>>>> I may have to add a few checks in order to avoid all collisions >>>>>>> between anticompactions and new sessions, but it should be helpful if >>>>>>> you >>>>>>> struggle with incremental repair. >>>>>>> >>>>>>> In any case, check if your nodes are still anticompacting before >>>>>>> trying to run a new repair session on a node. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie < >>>>>>> robert.sic...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> I have a cluster of 5 nodes, cassandra 3.0.5. >>>>>>>> I was running nodetool repair last days, one node at a time, when I >>>>>>>> first encountered this exception >>>>>>>> >>>>>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409 >>>>>>>> CassandraDaemon.java:195 - Exception in thread >>>>>>>> Thread[ValidationExecutor:11,1,main]* >>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions >>>>>>>> over the same sstables* >>>>>>>> * at >>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1194) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1084) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:80) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:714) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>>>> [na:1.8.0_60]* >>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]* >>>>>>>> >>>>>>>> On some of the other boxes I see this: >>>>>>>> >>>>>>>> >>>>>>>> *Caused by: org.apache.cassandra.exceptions.RepairException: >>>>>>>> [repair #9dd21ab0-83f4-11e6-b28f-df99132d7979 on >>>>>>>> notes/operator_source_mv, >>>>>>>> [(-7505573573695693981,-7495786486761919991],* >>>>>>>> *....* >>>>>>>> * (-8483612809930827919,-8480482504800860871]]] Validation failed >>>>>>>> in /10.45.113.67 <http://10.45.113.67>* >>>>>>>> * at >>>>>>>> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:408) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:168) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at org.apache.cassandra.net >>>>>>>> <http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:67) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>>>> [na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>>>> [na:1.8.0_60]* >>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]* >>>>>>>> *ERROR [RepairJobTask:3] 2016-09-26 16:39:33,096 >>>>>>>> CassandraDaemon.java:195 - Exception in thread >>>>>>>> Thread[RepairJobTask:3,5,RMI >>>>>>>> Runtime]* >>>>>>>> *java.lang.AssertionError: java.lang.InterruptedException* >>>>>>>> * at org.apache.cassandra.net >>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:172) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at org.apache.cassandra.net >>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:761) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at org.apache.cassandra.net >>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:729) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> org.apache.cassandra.repair.ValidationTask.run(ValidationTask.java:56) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]* >>>>>>>> *Caused by: java.lang.InterruptedException: null* >>>>>>>> * at >>>>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at >>>>>>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) >>>>>>>> ~[na:1.8.0_60]* >>>>>>>> * at org.apache.cassandra.net >>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168) >>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]* >>>>>>>> * ... 6 common frames omitted* >>>>>>>> >>>>>>>> >>>>>>>> Now if I run nodetool repair I get the >>>>>>>> >>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions >>>>>>>> over the same sstables* >>>>>>>> >>>>>>>> exception. >>>>>>>> What do you suggest? would nodetool scrub or sstablescrub help in >>>>>>>> this case. or it would just make it worse? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Robert >>>>>>>> >>>>>>> -- >>>>>>> ----------------- >>>>>>> Alexander Dejanovski >>>>>>> France >>>>>>> @alexanderdeja >>>>>>> >>>>>>> Consultant >>>>>>> Apache Cassandra Consulting >>>>>>> http://www.thelastpickle.com >>>>>>> >>>>>> >>>>>> -- >>>>> ----------------- >>>>> Alexander Dejanovski >>>>> France >>>>> @alexanderdeja >>>>> >>>>> Consultant >>>>> Apache Cassandra Consulting >>>>> http://www.thelastpickle.com >>>>> >>>> >>>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> >> -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com >