[jira] [Comment Edited] (CASSANDRA-10389) Repair session exception Validation failed

Heiko Sommer (JIRA) Thu, 16 Jun 2016 02:10:32 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333412#comment-15333412
 ]


Heiko Sommer edited comment on CASSANDRA-10389 at 6/16/16 9:09 AM:
-------------------------------------------------------------------

I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. 
As a workaround I must restart all nodes before running a repair. 

For sure I do not start multiple repairs simultaneously. Here is what happened 
the last time I tried it out: The previous incremental repair ("{{nodetool 
repair --partitioner-range -- mykeyspace}}") started on a single node after 
rolling cluster restart finished nicely, with the expected number of "Session 
completed successfully" logs. There were no more repair tasks or anticompaction 
tasks running, the cluster was stable. I restarted C* on 4 nodes, but left it 
running on 2 nodes. On one of the restarted nodes I ran an incremental repair 
again, this time also with the "{{--sequential}}" option. 
On the repairing node I get failure logs such as
{noformat}
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - 
Exception in thread Thread[Repair#1:16,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at 
com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387)
 ~[guava-16.0.jar:na]
{noformat}
while on the failing target nodes (those that were not restarted before the 
repair) I get logs such as
{noformat}
ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 
RepairMessageVerbHandler.java:108 - Cannot start multiple repair sessions over 
the same sstables
{noformat}

Before that, I also tried with full repair, and got the impression that it is 
the same problem for full or incremental repairs. 
As I can reproduce the issue, I would be glad to provide you with more logs or 
some experimenting if that would help resolve the issue. 


was (Author: hsommer):
I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. 
As a workaround I must restart all nodes before running a repair. 

For sure I do not start multiple repairs simultaneously. Here is what happened 
the last time I tried it out: The previous incremental repair ("nodetool repair 
--partitioner-range -- mykeyspace") started on a single node after rolling 
cluster restart finished nicely, with the expected number of "Session completed 
successfully" logs. There were no more repair tasks or anticompaction tasks 
running, the cluster was stable. I restarted C* on 4 nodes, but left it running 
on 2 nodes. On one of the restarted nodes I ran an incremental repair again, 
this time also with the "--sequential" option. 
On the repairing node I get failure logs such as
{noformat}
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - 
Exception in thread Thread[Repair#1:16,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at 
com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387)
 ~[guava-16.0.jar:na]
{noformat}
while on the failing target nodes (those that were not restarted before the 
repair) I get logs such as
{noformat}
ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 
RepairMessageVerbHandler.java:108 - Cannot start multiple repair sessions over 
the same sstables
{noformat}

Before that, I also tried with full repair, and got the impression that it is 
the same problem for full or incremental repairs. 
As I can reproduce the issue, I would be glad to provide you with more logs or 
some experimenting if that would help resolve the issue. 

> Repair session exception Validation failed
> ------------------------------------------
>
>                 Key: CASSANDRA-10389
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10389
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax 
> compilation)
>            Reporter: Jędrzej Sieracki
>             Fix For: 2.2.x
>
>
> I'm running a repair on a ring of nodes, that was recently extented from 3 to 
> 13 nodes. The extension was done two days ago, the repair was attempted 
> yesterday.
> {quote}
> [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace 
> perspectiv with repair options (parallelism: parallel, primary range: false, 
> incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], 
> hosts: [], # of ranges: 517)
> [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 
> for range (-5927186132136652665,-5917344746039874798] failed with error 
> [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8 on 
> perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] 
> Validation failed in cblade1.XXX/XXX (progress: 0%)
> {quote}
> BTW, I am ignoring the LEAK errors for now, that's outside of the scope of 
> the main issue:
> {quote}
> ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big
>  was not released before the reference was garbage collected
> {quote}
> I scrubbed the sstable with failed validation on cblade1 with nodetool scrub 
> perspectiv stock_increment_agg:
> {quote}
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 
> - Scrubbing 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
>  (345466609 bytes)
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 
> - Scrubbing 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
>  (60496378 bytes)
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big
>  was not released before the reference was garbage collected
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 
> - Scrub of 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
>  complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 
> - Scrub of 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
>  complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped
> {quote}
> Now, after scrubbing, another repair was attempted, it did finish, but with 
> lots of errors from other nodes:
> {quote}
> [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 
> for range (5019296454787813261,5021512586040808168] failed with error [repair 
> #db476b51-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (5019296454787813261,5021512586040808168]] Validation failed in /10.YYY 
> (progress: 91%)
> [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 
> for range (-3660233266780784242,-3638577078894365342] failed with error 
> [repair #db482ea1-6110-11e5-b992-9f13fa8664c8 on 
> perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] 
> Validation failed in /10.XXX (progress: 92%)
> [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 
> for range (9158857758535272856,9167427882441871745] failed with error [repair 
> #db4a0361-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (9158857758535272856,9167427882441871745]] Validation failed in /10.YYY 
> (progress: 95%)
> {quote}
> After scrubbing stock_increment_agg on all nodes, just to be sure, the repair 
> still failed, this time with the following exception:
> {quote}
> INFO  [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair 
> #ea123bf3-6111-11e5-b992-9f13fa8664c8] Requesting merkle trees for 
> stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX])
> ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - 
> [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8] Session completed with the 
> following error
> org.apache.cassandra.exceptions.RepairException: [repair 
> #ea123bf0-6111-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (355657753119264326,366309649129068298]] Validation failed in cblade1.
>         at 
> org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_60]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_60]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10389) Repair session exception Validation failed

Reply via email to