[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284346#comment-14284346 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] [~yukim], is it planned to include this one in 2.1.3 release? Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284349#comment-14284349 ] Marcus Eriksson commented on CASSANDRA-8316: oops, yes, will get it committed tomorrow Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273795#comment-14273795 ] Yuki Morishita commented on CASSANDRA-8316: --- LGTM. +1. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267561#comment-14267561 ] Marcus Eriksson commented on CASSANDRA-8316: bq. B does not mark sstables as repaired for just receiving prepare message, doesn't it? no - but it keeps the sstables in a set to make sure that we don't start multiple repairs including the same sstables - this would be pretty pointless as the sstables will be gone after anticompaction and can't be marked I also think it is 'good enough' for now to let it fail and let users clean up manually (since incremental repairs are not default in 2.1) [~yukim] could you review the patch as well? Pushed rebased here: https://github.com/krummas/cassandra/commits/marcuse/8316 Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266614#comment-14266614 ] Yuki Morishita commented on CASSANDRA-8316: --- bq. 4. B finishes preparing and marks a bunch of sstables as being repaired B does not mark sstables as repaired for just receiving prepare message, doesn't it? I understand that the current issue we have is prepared repair session is left on replica nodes when preparing timed out on coordinator. (In that case, user can work around by doing forceTerminateRepairSession manually.) I prefer sending cancel message, though adding new message may be difficult in minor release. Also we have to make sure message won't get dropped since AntiEntropyStage may be still busy preparing when cancel message arrives. Alternatively, I think the right solution to automatically remove left sessions is to track repair status as we do in CASSANDRA-5839 and use that to determine which prepared session can be removed. Either way, I think we can move this to resolve in 3.0 if I didn't miss the severity of the issue. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251628#comment-14251628 ] Marcus Eriksson commented on CASSANDRA-8316: To summarize this; 1. we had a bug in compaction marking that could make a node end up in an infinite loop, fixed in branch linked above 2. we allowed multiple repairs over the same sstables, fixed 3. we had a situation where we didn't remove the parent repair sessions, fixed And, to describe the final problem: # Node A sends a PrepareMessage to overloaded Node B # B starts preparing # A times out waiting for B to prepare # B finishes preparing and marks a bunch of sstables as being repaired # User retries the repair on node A # B gets the new PrepareMessage but sees that the sstables it wants to repair are already being repaired, and refuses to start One solution could be to have A send out a cancel message, another solution could be to have B remove any parent repair sessions after 5 (or something) minutes if it hasn't received a validation message before that. Need [~yukim] input. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249690#comment-14249690 ] Marcus Eriksson commented on CASSANDRA-8316: I think we are simply timing out the Prepare message when TRACE is enabled (I can't even start a 8 node cluster with TRACE on) One solution could be to increase the timeout, but we use the same timeout for snapshot creation and that would be just as likely to fail on a heavily loaded cluster, wdyt [~yukim]? Also, note, that in your test you repair all ranges, meaning, when you repair node5 for example, you actually include node3,4,5,6,7, so you can't repair any of those at the same time Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1424#comment-1424 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] I use my !test.sh! script, but with n=100. You can see that my 3 repairs are sent in parallel. The first 2 failed quickly with *no positive replies* error and the last one run for while then failed. At this point the cluster cannot be repaired anymore. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246673#comment-14246673 ] Marcus Eriksson commented on CASSANDRA-8316: [~aboudreault] ok, I suspect there could be a race when you start repairs at exactly the same time over the same sstables Could you try upping the cluster to 9 nodes and avoiding starting repairs over the same sstables? I'll check the above issue Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246704#comment-14246704 ] Alan Boudreault commented on CASSANDRA-8316: Note that I just tried to run my test without sending my 3 repair commands in background, so the repair are executed one after one, Same results. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246899#comment-14246899 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] Is it worth running the test you ask earlier, if we got the same result without pushing repairs as background task? I suppose that if it happens with a single inc repair command too... it's something wrong internally with inc repair, and not necessarily a race condition of running different repair commands on the same sstable. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246358#comment-14246358 ] Marcus Eriksson commented on CASSANDRA-8316: [~aboudreault] how do you reproduce this? The Repair failed with error Already repairing-error is supposed to happen if you start several repairs over the same sstables Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244294#comment-14244294 ] Alan Boudreault commented on CASSANDRA-8316: Thanks [~krummas], I will run my tests today if possible, otherwise during the weekend and get back to you. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241147#comment-14241147 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] [~yukim] With some test runs, I cannot see the high CPU utilization issue again. However, I still see the error message. Also I've noticed an important changes between with and without the patch. WITHOUT the patch: I can re-run the increment repairs. I might get again the error message on the node that initially failed, but things will get OK after the initial endpoints that failed are repaired. WITH the patch: I cannot do an incremental repairs anymore, even after a restart. This is what I get trying to run the repairs on my node: {code} aboudreault@kovarro:~/dev/cstar/8316$ ccm node1 nodetool -- repair -par -inc [2014-12-10 09:00:42,767] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-10 09:00:48,045] Repair session ee2a78c0-8074-11e4-9b59-bbfe19a8e904 for range (4611686018427387904,6917529027641081856] finished [2014-12-10 09:00:48,046] Repair session ef77e050-8074-11e4-9b59-bbfe19a8e904 for range (2305843009213693952,4611686018427387904] finished [2014-12-10 09:00:48,048] Repair session f06107d0-8074-11e4-9b59-bbfe19a8e904 for range (6917529027641081856,-9223372036854775808] finished [2014-12-10 09:00:48,078] Repair command #1 finished [2014-12-10 09:00:48,088] Nothing to repair for keyspace 'system' [2014-12-10 09:00:48,104] Starting repair command #2, repairing 2 ranges for keyspace system_traces (parallelism=PARALLEL, full=false) [2014-12-10 09:00:58,916] Repair failed with error Did not get positive replies from all endpoints. List of failed endpoint(s): [127.0.0.2] aboudreault@kovarro:~/dev/cstar/8316$ ccm node2 nodetool -- repair -par -inc [2014-12-10 09:01:07,233] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-10 09:01:07,239] Repair failed with error Already repairing SSTableReader(path='/home/aboudreault/.ccm/local/node2/data/r1/Standard1-c38dd6f0807111e494d8bbfe19a8e904/r1-Standard1-ka-5-Data.db'), can not continue. [2014-12-10 09:01:07,247] Nothing to repair for keyspace 'system' [2014-12-10 09:01:07,252] Starting repair command #2, repairing 2 ranges for keyspace system_traces (parallelism=PARALLEL, full=false) [2014-12-10 09:01:07,254] Repair failed with error null {code} Does this help? Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153 ] Marcus Eriksson commented on CASSANDRA-8316: yep, will have a look, seems that we don't clear out the repair session on this failure mode Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241502#comment-14241502 ] Alan Boudreault commented on CASSANDRA-8316: Just pointing in case this ticket is related: CASSANDRA-8291 Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239473#comment-14239473 ] Yuki Morishita commented on CASSANDRA-8316: --- Can you get yourkit snapshot for the patced 2.1? It may show something. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239522#comment-14239522 ] Alan Boudreault commented on CASSANDRA-8316: [~yukim] Yes, will do today during my bootcamp breaks. Something that I'm thinking is - if the patch at least avoid the high CPU utilization on failure, it might bean acceptable fix. Will get back to you asap Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234581#comment-14234581 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] [~yukim] As mentionned on IRC, the patch dont' to fix the issue in cassandra-2.1. In trunk (3.0), the following commit fixed the did not get positive error: https://github.com/apache/cassandra/commit/06f626acd27b051222616c0c91f7dd8d556b8d45 but this one is already in branch cassandra-2.1 and there are many additional major changes related to repair in 3.0. Any suggestion at this point ? Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231530#comment-14231530 ] Alan Boudreault commented on CASSANDRA-8316: [~yukim] Thank you for taking a look at the snapshot. That's great if the patch is only a thread dispatching! I'll wait a patch from you, (or [~krummas] ?) And will re-run my entire tests. Let me know if I can do anything else to help. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Fix For: 2.1.3 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229587#comment-14229587 ] Loic Lambiel commented on CASSANDRA-8316: - Hi guys, Any chance to get this issue fixed for 2.1.3 ? On our side we face this issue on almost all incremental repairs Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230269#comment-14230269 ] Alan Boudreault commented on CASSANDRA-8316: This issue seems to be effectively critical. I reproduced the issues on all inc repairs while working on CASSANDRA-8366 and have seen very inconsistent results in storage size. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Fix For: 2.1.3 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230299#comment-14230299 ] Marcus Eriksson commented on CASSANDRA-8316: Storage size issue is probably CASSANDRA-8386 and CASSANDRA-8267 Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Fix For: 2.1.3 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230776#comment-14230776 ] Yuki Morishita commented on CASSANDRA-8316: --- Looking at Alan's yourkit snapshot, I noticed that anti compaction (actually, markCompacting loop before anti-compaction) is long-running on AntiEntropyStage. This causes prepare phase from subsequent incremental repair to fail because single-threaded AntiEntropyStage is occupied still. We should just dispatch to another thread from AntiEntropyStage so that nothing will block there. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Fix For: 2.1.3 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212045#comment-14212045 ] Marcus Eriksson commented on CASSANDRA-8316: any exceptions on any other nodes? Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212055#comment-14212055 ] Loic Lambiel commented on CASSANDRA-8316: - Nope, nothing special noticed on other nodes (except load on few nodes) Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212063#comment-14212063 ] Marcus Eriksson commented on CASSANDRA-8316: [~enigmacurry] can your team reproduce? running with -inc and -par on a ~15 node cluster with vnodes Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212103#comment-14212103 ] Loic Lambiel commented on CASSANDRA-8316: - I forgot to mention that I'm using LCS, in case of Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Ryan McGuire Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)