Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception

Alexander Dejanovski Wed, 28 Sep 2016 07:37:00 -0700

They will show up in nodetool compactionstats :
https://issues.apache.org/jira/browse/CASSANDRA-9098


Did you check nodetool tpstats to see if you didn't have any running repair
session ?
Just to make sure (and if you can actually do it), roll restart the cluster
and try again. Repair sessions can get sticky sometimes.

On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <[email protected]>
wrote:

> I am using nodetool compactionstats to check for pending compactions and
> it shows me 0 pending on all nodes, seconds before running nodetool repair.
> I am also monitoring PendingCompactions on jmx.
>
> Is there other way I can find out if is there any anticompaction running
> on any node?
>
> Thanks a lot,
> Robert
>
> Robert Sicoie
>
> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski <
> [email protected]> wrote:
>
>> Robert,
>>
>> you need to make sure you have no repair session currently running on
>> your cluster, and no anticompaction.
>> I'd recommend doing a rolling restart in order to stop all running repair
>> for sure, then start the process again, node by node, checking that no
>> anticompaction is running before moving from one node to the other.
>>
>> Please do not use the -pr switch as it is both useless (token ranges are
>> repaired only once with inc repair, whatever the replication factor) and
>> harmful as all anticompactions won't be executed (you'll still have
>> sstables marked as unrepaired even if the process has ran entirely with no
>> error).
>>
>> Let us know how that goes.
>>
>> Cheers,
>>
>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <[email protected]>
>> wrote:
>>
>>> Thanks Alexander,
>>>
>>> Now I started to run the repair with -pr arg and with keyspace and table
>>> args.
>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288
>>> RepairRunnable.java:246 - Repair session
>>> 89af4d10-856f-11e6-b28f-df99132d7979 for range
>>> [(8323429577695061526,8326640819362122791],
>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed in /
>>> 10.45.113.88"
>>>
>>> for one of the tables. 10.45.113.88 is the ip of the machine I am
>>> running the nodetool on.
>>> I'm wondering if this is normal...
>>>
>>> Thanks,
>>> Robert
>>>
>>>
>>>
>>>
>>> Robert Sicoie
>>>
>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> nodetool scrub won't help here, as what you're experiencing is most
>>>> likely that one SSTable is going through anticompaction, and then another
>>>> node is asking for a Merkle tree that involves it.
>>>> For understandable reasons, an SSTable cannot be anticompacted and
>>>> validation compacted at the same time.
>>>>
>>>> The solution here is to adjust the repair pressure on your cluster so
>>>> that anticompaction can end before you run repair on another node.
>>>> You may have a lot of anticompaction to do if you had high volumes of
>>>> unrepaired data, which can take a long time depending on several factors.
>>>>
>>>> You can tune your repair process to make sure no anticompaction is
>>>> running before launching a new session on another node or you can try my
>>>> Reaper fork that handles incremental repair :
>>>> https://github.com/adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui
>>>> I may have to add a few checks in order to avoid all collisions between
>>>> anticompactions and new sessions, but it should be helpful if you struggle
>>>> with incremental repair.
>>>>
>>>> In any case, check if your nodes are still anticompacting before trying
>>>> to run a new repair session on a node.
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have a cluster of 5 nodes, cassandra 3.0.5.
>>>>> I was running nodetool repair last days, one node at a time, when I
>>>>> first encountered this exception
>>>>>
>>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409
>>>>> CassandraDaemon.java:195 - Exception in thread
>>>>> Thread[ValidationExecutor:11,1,main]*
>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>> over the same sstables*
>>>>> * at
>>>>> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1194)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1084)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:80)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:714)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> [na:1.8.0_60]*
>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>
>>>>> On some of the other boxes I see this:
>>>>>
>>>>>
>>>>> *Caused by: org.apache.cassandra.exceptions.RepairException: [repair
>>>>> #9dd21ab0-83f4-11e6-b28f-df99132d7979 on notes/operator_source_mv,
>>>>> [(-7505573573695693981,-7495786486761919991],*
>>>>> *....*
>>>>> * (-8483612809930827919,-8480482504800860871]]] Validation failed in
>>>>> /10.45.113.67 <http://10.45.113.67>*
>>>>> * at
>>>>> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:408)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:168)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at org.apache.cassandra.net
>>>>> <http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>> ~[na:1.8.0_60]*
>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> [na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> [na:1.8.0_60]*
>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>> *ERROR [RepairJobTask:3] 2016-09-26 16:39:33,096
>>>>> CassandraDaemon.java:195 - Exception in thread 
>>>>> Thread[RepairJobTask:3,5,RMI
>>>>> Runtime]*
>>>>> *java.lang.AssertionError: java.lang.InterruptedException*
>>>>> * at org.apache.cassandra.net
>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:172)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at org.apache.cassandra.net
>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:761)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at org.apache.cassandra.net
>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:729)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> org.apache.cassandra.repair.ValidationTask.run(ValidationTask.java:56)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> ~[na:1.8.0_60]*
>>>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]*
>>>>> *Caused by: java.lang.InterruptedException: null*
>>>>> * at
>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>>>>> ~[na:1.8.0_60]*
>>>>> * at
>>>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>>>>> ~[na:1.8.0_60]*
>>>>> * at org.apache.cassandra.net
>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168)
>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>> * ... 6 common frames omitted*
>>>>>
>>>>>
>>>>> Now if I run nodetool repair I get the
>>>>>
>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>> over the same sstables*
>>>>>
>>>>> exception.
>>>>> What do you suggest? would nodetool scrub or sstablescrub help in this
>>>>> case. or it would just make it worse?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Robert
>>>>>
>>>> --
>>>> -----------------
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>>
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>
>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception

Reply via email to