[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-20 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284346#comment-14284346
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~krummas] [~yukim], is it planned to include this one in 2.1.3 release?

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-20 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284349#comment-14284349
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


oops, yes, will get it committed tomorrow

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-12 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273795#comment-14273795
 ] 

Yuki Morishita commented on CASSANDRA-8316:
---

LGTM. +1.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-07 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267561#comment-14267561
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


bq. B does not mark sstables as repaired for just receiving prepare message, 
doesn't it?

no - but it keeps the sstables in a set to make sure that we don't start 
multiple repairs including the same sstables - this would be pretty pointless 
as the sstables will be gone after anticompaction and can't be marked

I also think it is 'good enough' for now to let it fail and let users clean up 
manually (since incremental repairs are not default in 2.1)

[~yukim] could you review the patch as well? Pushed rebased here: 
https://github.com/krummas/cassandra/commits/marcuse/8316

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-06 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266614#comment-14266614
 ] 

Yuki Morishita commented on CASSANDRA-8316:
---

bq. 4. B finishes preparing and marks a bunch of sstables as being repaired

B does not mark sstables as repaired for just receiving prepare message, 
doesn't it?

I understand that the current issue we have is prepared repair session is left 
on replica nodes when preparing timed out on coordinator.
(In that case, user can work around by doing forceTerminateRepairSession 
manually.)

I prefer sending cancel message, though adding new message may be difficult in 
minor release. Also we have to make sure message won't get dropped since 
AntiEntropyStage may be still busy preparing when cancel message arrives.

Alternatively, I think the right solution to automatically remove left sessions 
is to track repair status as we do in CASSANDRA-5839 and use that to determine 
which prepared session can be removed.

Either way, I think we can move this to resolve in 3.0 if I didn't miss the 
severity of the issue.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-18 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251628#comment-14251628
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


To summarize this;
1. we had a bug in compaction marking that could make a node end up in an 
infinite loop, fixed in branch linked above
2. we allowed multiple repairs over the same sstables, fixed
3. we had a situation where we didn't remove the parent repair sessions, fixed

And, to describe the final problem:

# Node A sends a PrepareMessage to overloaded Node B
# B starts preparing
# A times out waiting for B to prepare
# B finishes preparing and marks a bunch of sstables as being repaired
# User retries the repair on node A
# B gets the new PrepareMessage but sees that the sstables it wants to repair 
are already being repaired, and refuses to start

One solution could be to have A send out a cancel message, another solution 
could be to have B remove any parent repair sessions after 5 (or something) 
minutes if it hasn't received a validation message before that. Need [~yukim] 
input.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-17 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249690#comment-14249690
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


I think we are simply timing out the Prepare message when TRACE is enabled (I 
can't even start a 8 node cluster with TRACE on)

One solution could be to increase the timeout, but we use the same timeout for 
snapshot creation and that would be just as likely to fail on a heavily loaded 
cluster, wdyt [~yukim]?

Also, note, that in your test you repair all ranges, meaning, when you repair 
node5 for example, you actually include node3,4,5,6,7, so you can't repair any 
of those at the same time


  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-15 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1424#comment-1424
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~krummas] I use my !test.sh! script, but with n=100. You can see that my 3 
repairs are sent in parallel. The first 2 failed quickly with *no positive 
replies* error and the last one run for while then failed. At this point the 
cluster cannot be repaired anymore.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-15 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246673#comment-14246673
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


[~aboudreault] ok, I suspect there could be a race when you start repairs at 
exactly the same time over the same sstables

Could you try upping the cluster to 9 nodes and avoiding starting repairs over 
the same sstables? I'll check the above issue

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-15 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246704#comment-14246704
 ] 

Alan Boudreault commented on CASSANDRA-8316:


Note that I just tried to run my test without sending my 3 repair commands in 
background, so the repair are executed one after one, Same results.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-15 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246899#comment-14246899
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~krummas] Is it worth running the test you ask earlier, if we got the same 
result without pushing repairs as background task? I suppose that if it happens 
with a single inc repair command too... it's something wrong internally with 
inc repair, and not necessarily a race condition of running different repair 
commands on the same sstable.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-14 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246358#comment-14246358
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


[~aboudreault] how do you reproduce this?

The Repair failed with error Already repairing-error is supposed to happen if 
you start several repairs over the same sstables

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-12 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244294#comment-14244294
 ] 

Alan Boudreault commented on CASSANDRA-8316:


Thanks [~krummas], I will run my tests today if possible, otherwise during the 
weekend and get back to you.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-10 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241147#comment-14241147
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~krummas] [~yukim] With some test runs, I cannot see the high CPU utilization 
issue again. However, I still see the error message. Also I've noticed an 
important changes between with and without the patch. 

WITHOUT the patch: I can re-run the increment repairs. I might get again the 
error message on the node that initially failed, but things will get OK after 
the initial endpoints that failed are repaired.

WITH the patch: I cannot do an incremental repairs anymore, even after a 
restart. This is what I get trying to run the repairs on my node:

{code}
aboudreault@kovarro:~/dev/cstar/8316$ ccm node1 nodetool -- repair -par -inc
[2014-12-10 09:00:42,767] Starting repair command #1, repairing 3 ranges for 
keyspace r1 (parallelism=PARALLEL, full=false)
[2014-12-10 09:00:48,045] Repair session ee2a78c0-8074-11e4-9b59-bbfe19a8e904 
for range (4611686018427387904,6917529027641081856] finished
[2014-12-10 09:00:48,046] Repair session ef77e050-8074-11e4-9b59-bbfe19a8e904 
for range (2305843009213693952,4611686018427387904] finished
[2014-12-10 09:00:48,048] Repair session f06107d0-8074-11e4-9b59-bbfe19a8e904 
for range (6917529027641081856,-9223372036854775808] finished
[2014-12-10 09:00:48,078] Repair command #1 finished
[2014-12-10 09:00:48,088] Nothing to repair for keyspace 'system'
[2014-12-10 09:00:48,104] Starting repair command #2, repairing 2 ranges for 
keyspace system_traces (parallelism=PARALLEL, full=false)
[2014-12-10 09:00:58,916] Repair failed with error Did not get positive replies 
from all endpoints. List of failed endpoint(s): [127.0.0.2]
aboudreault@kovarro:~/dev/cstar/8316$ ccm node2 nodetool -- repair -par -inc
[2014-12-10 09:01:07,233] Starting repair command #1, repairing 3 ranges for 
keyspace r1 (parallelism=PARALLEL, full=false)
[2014-12-10 09:01:07,239] Repair failed with error Already repairing 
SSTableReader(path='/home/aboudreault/.ccm/local/node2/data/r1/Standard1-c38dd6f0807111e494d8bbfe19a8e904/r1-Standard1-ka-5-Data.db'),
 can not continue.
[2014-12-10 09:01:07,247] Nothing to repair for keyspace 'system'
[2014-12-10 09:01:07,252] Starting repair command #2, repairing 2 ranges for 
keyspace system_traces (parallelism=PARALLEL, full=false)
[2014-12-10 09:01:07,254] Repair failed with error null
{code}

Does this help?


  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-10 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


yep, will have a look, seems that we don't clear out the repair session on this 
failure mode

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-10 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241502#comment-14241502
 ] 

Alan Boudreault commented on CASSANDRA-8316:


Just pointing in case this ticket is related: CASSANDRA-8291

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-09 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239473#comment-14239473
 ] 

Yuki Morishita commented on CASSANDRA-8316:
---

Can you get yourkit snapshot for the patced 2.1?
It may show something.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-09 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239522#comment-14239522
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~yukim] Yes, will do today during my bootcamp breaks. Something that I'm 
thinking is - if the patch at least avoid the high CPU utilization on failure, 
it might bean acceptable fix. Will get back to you asap

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-04 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234581#comment-14234581
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~krummas] [~yukim] As mentionned on IRC, the patch dont' to fix the issue in 
cassandra-2.1. In trunk (3.0), the following commit fixed the did not get 
positive error:

https://github.com/apache/cassandra/commit/06f626acd27b051222616c0c91f7dd8d556b8d45

but this one is already in branch cassandra-2.1 and there are many additional 
major changes related to repair in 3.0. 

Any suggestion at this point ?

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-02 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231530#comment-14231530
 ] 

Alan Boudreault commented on CASSANDRA-8316:


[~yukim] Thank you for taking a look at the snapshot. That's great if the patch 
is only a thread dispatching! I'll wait a patch from you, (or [~krummas] ?) And 
will re-run my entire tests. Let me know if I can do anything else to help.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Alan Boudreault
 Fix For: 2.1.3

 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-01 Thread Loic Lambiel (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229587#comment-14229587
 ] 

Loic Lambiel commented on CASSANDRA-8316:
-

Hi guys,

Any chance to get this issue fixed for 2.1.3 ? On our side we face this issue 
on almost all incremental repairs

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Alan Boudreault
 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-01 Thread Alan Boudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230269#comment-14230269
 ] 

Alan Boudreault commented on CASSANDRA-8316:


This issue seems to be effectively critical. I reproduced the issues on all inc 
repairs while working on CASSANDRA-8366 and have seen very inconsistent results 
in storage size.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Alan Boudreault
 Fix For: 2.1.3

 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-01 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230299#comment-14230299
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


Storage size issue is probably CASSANDRA-8386 and CASSANDRA-8267

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Alan Boudreault
 Fix For: 2.1.3

 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-01 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230776#comment-14230776
 ] 

Yuki Morishita commented on CASSANDRA-8316:
---

Looking at Alan's yourkit snapshot, I noticed that anti compaction (actually, 
markCompacting loop before anti-compaction) is long-running on AntiEntropyStage.
This causes prepare phase from subsequent incremental repair to fail because 
single-threaded AntiEntropyStage is occupied still.

We should just dispatch to another thread from AntiEntropyStage so that nothing 
will block there.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Alan Boudreault
 Fix For: 2.1.3

 Attachments: CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-14 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212045#comment-14212045
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


any exceptions on any other nodes?

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel

 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-14 Thread Loic Lambiel (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212055#comment-14212055
 ] 

Loic Lambiel commented on CASSANDRA-8316:
-

Nope, nothing special noticed on other nodes (except load on few nodes)

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel

 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-14 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212063#comment-14212063
 ] 

Marcus Eriksson commented on CASSANDRA-8316:


[~enigmacurry] can your team reproduce? running with -inc and -par on a ~15 
node cluster with vnodes

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel

 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-14 Thread Loic Lambiel (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212103#comment-14212103
 ] 

Loic Lambiel commented on CASSANDRA-8316:
-

I forgot to mention that I'm using LCS, in case of

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Ryan McGuire

 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)