subject:"\[jira\] \[Comment Edited\] \(CASSANDRA\-7560\) 'nodetool repair \-pr' leads to indefinitely hanging AntiEntropySession"

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-24 Thread Yuki Morishita (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1407#comment-1407
]

Yuki Morishita edited comment on CASSANDRA-7560 at 7/24/14 4:18 PM:

From the jstack logs, it looks like repair session on coordinator node is
waiting for validations (merkle trees), but none of the logs show
ValidationExecutor running.
By default, repair takes snapshot before validating, so it is possible that
snapshotting is taking longer on replica node.

One possible 'hang' point is snapshot time out. Coordinator waits snapshot
response for rpc_timeout millisec, and after that, response handler can be
removed.
-This is addressed in CASSANDRA-6747, and fixed for 2.1.0.-
edit: actually, it is not solving the problem. we need to handle timeouts
described here.

You can try temporarily set rpc_timeout longer and see if that solves the
problem.

was (Author: yukim):
From the jstack logs, it looks like repair session on coordinator node is
waiting for validations (merkle trees), but none of the logs show
ValidationExecutor running.
By default, repair takes snapshot before validating, so it is possible that
snapshotting is taking longer on replica node.

One possible 'hang' point is snapshot time out. Coordinator waits snapshot
response for rpc_timeout millisec, and after that, response handler can be
removed.
This is addressed in CASSANDRA-6747, and fixed for 2.1.0.

You can try temporarily set rpc_timeout longer and see if that solves the
problem.

'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
--

Key: CASSANDRA-7560
URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Vladimir Avram
Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log,
cassandra_daemon_rep2.log, nodetool_command.log

Running {{nodetool repair -pr}} will sometimes hang on one of the resulting
AntiEntropySessions.
The system logs will show the repair command starting
{noformat}
INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569)
Starting repair command #1, repairing 256 ranges for keyspace x
{noformat}
You can then see a few AntiEntropySessions completing with:
{noformat}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed
successfully
{noformat}
Finally we reach an AntiEntropySession at some point that hangs just before
requesting the merkle trees for the next column family in line for repair. So
we first see the previous CF being finished and the whole repair sessions
hangs here with no visible progress or errors on this or any of the related
nodes.
{noformat}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line
221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully
synced
{noformat}
Notes:
* Single DC 6 node cluster with an average load of 86 GB per node.
* This appears to be random; it does not always happen on the same CF or on
the same session.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-24 Thread Yuki Morishita (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1407#comment-1407
]

Yuki Morishita edited comment on CASSANDRA-7560 at 7/24/14 4:33 PM:

One possible 'hang' point is snapshot time out. Coordinator waits snapshot
response for rpc_timeout millisec, and after that, response handler can be
removed.
-This is addressed in CASSANDRA-6747, and fixed for 2.1.0.-
-edit: actually, it is not solving the problem. we need to handle timeouts
described here.-

edit: edit: CASSANDRA-6747 handles timeout also, but the reason we put that to
2.1.0 is that we needed protocol change. It is possible that we can backport
only timeout part.

You can try temporarily set rpc_timeout longer and see if that solves the
problem.

'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067338#comment-14067338
 ] 

Vladimir Avram edited comment on CASSANDRA-7560 at 7/19/14 1:56 AM:


There is also a stalled AntiEntropySession on rep1


was (Author: vladmore):
There is also a stalled AntiEntropySession on this node.

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, 
 cassandra_daemon_rep2.log, nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

3 matches

Site Navigation

Mail list logo

Footer information