[ https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333412#comment-15333412 ]
Heiko Sommer edited comment on CASSANDRA-10389 at 6/16/16 9:09 AM: ------------------------------------------------------------------- I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. As a workaround I must restart all nodes before running a repair. For sure I do not start multiple repairs simultaneously. Here is what happened the last time I tried it out: The previous incremental repair ("{{nodetool repair --partitioner-range -- mykeyspace}}") started on a single node after rolling cluster restart finished nicely, with the expected number of "Session completed successfully" logs. There were no more repair tasks or anticompaction tasks running, the cluster was stable. I restarted C* on 4 nodes, but left it running on 2 nodes. On one of the restarted nodes I ran an incremental repair again, this time also with the "{{--sequential}}" option. On the repairing node I get failure logs such as {noformat} java.lang.RuntimeException: Could not create snapshot at /10.195.62.171 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79) ~[apache-cassandra-2.2.5.jar:2.2.5] ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - Exception in thread Thread[Repair#1:16,5,RMI Runtime] com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Could not create snapshot at /10.195.62.171 at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) ~[guava-16.0.jar:na] {noformat} while on the failing target nodes (those that were not restarted before the repair) I get logs such as {noformat} ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 RepairMessageVerbHandler.java:108 - Cannot start multiple repair sessions over the same sstables {noformat} Before that, I also tried with full repair, and got the impression that it is the same problem for full or incremental repairs. As I can reproduce the issue, I would be glad to provide you with more logs or some experimenting if that would help resolve the issue. was (Author: hsommer): I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. As a workaround I must restart all nodes before running a repair. For sure I do not start multiple repairs simultaneously. Here is what happened the last time I tried it out: The previous incremental repair ("nodetool repair --partitioner-range -- mykeyspace") started on a single node after rolling cluster restart finished nicely, with the expected number of "Session completed successfully" logs. There were no more repair tasks or anticompaction tasks running, the cluster was stable. I restarted C* on 4 nodes, but left it running on 2 nodes. On one of the restarted nodes I ran an incremental repair again, this time also with the "--sequential" option. On the repairing node I get failure logs such as {noformat} java.lang.RuntimeException: Could not create snapshot at /10.195.62.171 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79) ~[apache-cassandra-2.2.5.jar:2.2.5] ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - Exception in thread Thread[Repair#1:16,5,RMI Runtime] com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Could not create snapshot at /10.195.62.171 at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) ~[guava-16.0.jar:na] {noformat} while on the failing target nodes (those that were not restarted before the repair) I get logs such as {noformat} ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 RepairMessageVerbHandler.java:108 - Cannot start multiple repair sessions over the same sstables {noformat} Before that, I also tried with full repair, and got the impression that it is the same problem for full or incremental repairs. As I can reproduce the issue, I would be glad to provide you with more logs or some experimenting if that would help resolve the issue. > Repair session exception Validation failed > ------------------------------------------ > > Key: CASSANDRA-10389 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10389 > Project: Cassandra > Issue Type: Bug > Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax > compilation) > Reporter: Jędrzej Sieracki > Fix For: 2.2.x > > > I'm running a repair on a ring of nodes, that was recently extented from 3 to > 13 nodes. The extension was done two days ago, the repair was attempted > yesterday. > {quote} > [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace > perspectiv with repair options (parallelism: parallel, primary range: false, > incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], > hosts: [], # of ranges: 517) > [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 > for range (-5927186132136652665,-5917344746039874798] failed with error > [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8 on > perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] > Validation failed in cblade1.XXX/XXX (progress: 0%) > {quote} > BTW, I am ignoring the LEAK errors for now, that's outside of the scope of > the main issue: > {quote} > ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big > was not released before the reference was garbage collected > {quote} > I scrubbed the sstable with failed validation on cblade1 with nodetool scrub > perspectiv stock_increment_agg: > {quote} > INFO [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 > - Scrubbing > BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db') > (345466609 bytes) > INFO [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 > - Scrubbing > BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db') > (60496378 bytes) > ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big > was not released before the reference was garbage collected > ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big > was not released before the reference was garbage collected > ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big > was not released before the reference was garbage collected > ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big > was not released before the reference was garbage collected > INFO [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 > - Scrub of > BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db') > complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped > INFO [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 > - Scrub of > BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db') > complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped > {quote} > Now, after scrubbing, another repair was attempted, it did finish, but with > lots of errors from other nodes: > {quote} > [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 > for range (5019296454787813261,5021512586040808168] failed with error [repair > #db476b51-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, > (5019296454787813261,5021512586040808168]] Validation failed in /10.YYY > (progress: 91%) > [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 > for range (-3660233266780784242,-3638577078894365342] failed with error > [repair #db482ea1-6110-11e5-b992-9f13fa8664c8 on > perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] > Validation failed in /10.XXX (progress: 92%) > [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 > for range (9158857758535272856,9167427882441871745] failed with error [repair > #db4a0361-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, > (9158857758535272856,9167427882441871745]] Validation failed in /10.YYY > (progress: 95%) > {quote} > After scrubbing stock_increment_agg on all nodes, just to be sure, the repair > still failed, this time with the following exception: > {quote} > INFO [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair > #ea123bf3-6111-11e5-b992-9f13fa8664c8] Requesting merkle trees for > stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX]) > ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - > [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8] Session completed with the > following error > org.apache.cassandra.exceptions.RepairException: [repair > #ea123bf0-6111-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, > (355657753119264326,366309649129068298]] Validation failed in cblade1. > at > org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64) > ~[apache-cassandra-2.2.1.jar:2.2.1] > at > org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183) > ~[apache-cassandra-2.2.1.jar:2.2.1] > at > org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399) > ~[apache-cassandra-2.2.1.jar:2.2.1] > at > org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158) > ~[apache-cassandra-2.2.1.jar:2.2.1] > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > ~[apache-cassandra-2.2.1.jar:2.2.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)