[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2013-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593176#comment-13593176
 ] 

Hudson commented on HBASE-6060:
---

Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See 
[https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/])
HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey 
and Rajeshbabu) (Revision 1445350)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.95.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2013-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577006#comment-13577006
 ] 

Hudson commented on HBASE-6060:
---

Integrated in HBase-0.94 #839 (See 
[https://builds.apache.org/job/HBase-0.94/839/])
HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey 
and Rajeshbabu) (Revision 1445350)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2013-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577168#comment-13577168
 ] 

Hudson commented on HBASE-6060:
---

Integrated in HBase-0.94-security #110 (See 
[https://builds.apache.org/job/HBase-0.94-security/110/])
HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey 
and Rajeshbabu) (Revision 1445350)

 Result = SUCCESS
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507760#comment-13507760
 ] 

Sergey Shelukhin commented on HBASE-6060:
-

I can easily(?) repro this on 0.94.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.94.3
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507766#comment-13507766
 ] 

Sergey Shelukhin commented on HBASE-6060:
-

oh, didn't read all the comments, nevermind

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-01 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488817#comment-13488817
 ] 

Jimmy Xiang commented on HBASE-6060:


@Rajesh, are you ok with the latest patch on RB for trunk?

@Lars, I am fine with letting it go with 0.92 and 0.94. My patch won't work 
well in 0.92/0.94 since it depends on other patches which are too big to 
backport.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-01 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488858#comment-13488858
 ] 

rajeshbabu commented on HBASE-6060:
---

@Jimmy
Its good. +1

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488930#comment-13488930
 ] 

Hadoop QA commented on HBASE-6060:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12551735/trunk-6060_v3.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
85 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3210//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13489080#comment-13489080
 ] 

Hudson commented on HBASE-6060:
---

Integrated in HBase-TRUNK #3506 (See 
[https://builds.apache.org/job/HBase-TRUNK/3506/])
HBASE-6060 Regions's in OPENING state from failed regionservers takes a 
long time to recover (Revision 1404759)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13489243#comment-13489243
 ] 

Hudson commented on HBASE-6060:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #247 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/247/])
HBASE-6060 Regions's in OPENING state from failed regionservers takes a 
long time to recover (Revision 1404759)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Jimmy Xiang
 Fix For: 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch, trunk-6060_v3.3.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-31 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488488#comment-13488488
 ] 

Lars Hofhansl commented on HBASE-6060:
--

What's the feeling w.r.t. 0.92 and 0.94?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-29 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486333#comment-13486333
 ] 

Jimmy Xiang commented on HBASE-6060:


Good review.  Thanks.  I posted a new patch on RB: 
https://reviews.apache.org/r/7767/

In the new patch, I re-organized the SSH code a little bit, and handled regions 
in OPEN state too.
It also handles regions in OPENING/PENDING_OPEN/OFFLINE state, but was open on 
this region
server before that.  This could happen when the cluster starts up.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-28 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485607#comment-13485607
 ] 

rajeshbabu commented on HBASE-6060:
---

@Jimmy,
Patch looks good and it properly reassigns the regions transitioning(which are 
in PENDING_OPEN,OPENING state) to dead server by TM and also avoids double 
assignments.
But the regions which are in OPEN state may not be assigned. 
{code}
  RegionState rit = services.getAssignmentManager()
.getRegionStates().getRegionTransitionState(e.getKey());
  if (processDeadRegion(e.getKey(), e.getValue(),
  this.services.getAssignmentManager(),
  this.server.getCatalogTracker())) {
ServerName addressFromAM = this.services.getAssignmentManager()
  .getRegionStates().getRegionServerOfRegion(e.getKey());
if (rit != null  !rit.isClosing()  !rit.isPendingClose()  
!rit.isSplitting()) {
  // Skip regions that were in transition unless CLOSING or
  // PENDING_CLOSE
  LOG.info(Skip assigning region  + rit.toString());
{code}

If the region state is OPEN and server in RIT same as the dead server we can 
add it to toAssignRegions to assign.

- Before forcing region state to offline we are checking current state and if 
its other than CLOSED or OFFLINE(forceNewPlan is true in case of assign call 
from TM) we calling unassing on the region.
if the server in the RIT is dead we are deleting znode if its in 
M_ZK_REGION_CLOSING or RS_ZK_REGION_CLOSED but in case of 
PENDING_OPEN,OPENING,OPEN case znode state wont be both. May be we can avoid 
two network calls to zookeeper here by having region state check.
{code}
if (!serverManager.isServerOnline(server)) {
  // delete the node. if no node exists need not bother.
  deleteClosingOrClosedNode(region);
  regionOffline(region);
  return;
}
{code}
Please correct me if wrong. Great work Jimmy.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-28 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485830#comment-13485830
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Good review Rajesh.
Yes the OPEN thing seems to be handled.  Also we have the code guarding
{code}
else if (addressFromAM != null
 !addressFromAM.equals(this.serverName)) {
  LOG.debug(Skip assigning region 
+ e.getKey().getRegionNameAsString()
+  because it has been opened in 
+ addressFromAM.getServerName());
{code}
if the server name got changed just after OPEN got processed.  This should be 
fine.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485515#comment-13485515
 ] 

Jimmy Xiang commented on HBASE-6060:


Added a test, and uploaded v2: trunk-6060_v2.patch.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485523#comment-13485523
 ] 

Hadoop QA commented on HBASE-6060:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12551096/trunk-6060_v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
85 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor
  org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3165//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch, trunk-6060_v2.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485092#comment-13485092
 ] 

Jimmy Xiang commented on HBASE-6060:


I think we can let it go for 0.94 since timeout monitor can handle it and there 
is no better way to fix it, because the region state in 0.94 is not so reliable.

For 0.96, this one is not covered yet.  It still relies on timeout monitor. Let 
me cook up a patch for 0.96 now.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485173#comment-13485173
 ] 

Jimmy Xiang commented on HBASE-6060:


I uploaded a simple patch for 0.96: trunk-6060.patch. Could you please review?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485188#comment-13485188
 ] 

stack commented on HBASE-6060:
--

How does this patch address this issue Jimmy?

Way back I was trying to harden who owns the region in an earlier incarnation 
of this patch.  We had gray areas where region could be PENDING_OPEN and it was 
unclear if region was owned by the master or owned by the regionserver on 
failure, though repair was different depending on who had region control.  
Fixing region states is for another JIRA?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485189#comment-13485189
 ] 

Ted Yu commented on HBASE-6060:
---

{code}
-this.stamp.set(System.currentTimeMillis());
+setTimestamp(System.currentTimeMillis());
{code}
EnvironmentEdgeManager should be used.

It would be nice to include test(s) from Rajesh's patch(s).

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485206#comment-13485206
 ] 

Hadoop QA commented on HBASE-6060:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12551016/trunk-6060.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
85 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3158//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-26 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485209#comment-13485209
 ] 

Jimmy Xiang commented on HBASE-6060:


@Stack, here is my understanding on the problem. Master calls a rs to open a 
region. Now, in master memory, the region is in pending_open state with this 
rs' server name. Now the rs dies.  When SSH starts, it goes to meta to find all 
the regions on this rs, minus those regions already in transition, then assign 
the remaining regions. If the pending_open region (it could be opening too 
depending on timing) was on this region server before, SSH will take care of 
it.  Otherwise, if it was on a different region server, SSH will not pick it 
up.  In this patch, I just times out the region transition so that tm can 
change the state and re-assign it, instead of waiting for a long time (now, 20 
minutes by default).

I'd like to make sure the region states in master memory is reliable.  
Otherwise, it is of not much use. So I think master always has region control.
In 0.96, I think region states is very reliable now. Of course, there could be 
bugs I am not aware of yet.

@Ted, good point.  I will include the test. For EnvironmentEdgeManager, I will 
leave it to another jira.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch, 
 trunk-6060.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-10-25 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484706#comment-13484706
 ] 

Lars Hofhansl commented on HBASE-6060:
--

Looking at the patches and trying to read through the comments here, I have to 
admit... I have NO idea what is going on.
I think in 0.96 Jimmy is ripping apart AM, which would hopefully cover this.

Should we attempt this at all on 0.94, or just let it go? Please let me know.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.92.3, 0.94.3, 0.96.0

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-09-06 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13449825#comment-13449825
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Recently Jon's mail thread on rewriting AM altogether implies that the scenario 
mentioned here also will be covered ?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-09-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13449830#comment-13449830
 ] 

stack commented on HBASE-6060:
--

@Ram He's not at 'total rewrite' yet (smile).  What you think of that thread 
Ram?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-09-06 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13449903#comment-13449903
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack, Jon's comments makes sense.  Lot of thought process is needed though.
A better FSM may be needed.  Surely can try it out.  

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445496#comment-13445496
 ] 

Lars Hofhansl commented on HBASE-6060:
--

The last patch on RB is 7 weeks old. Is that still the one to review?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445497#comment-13445497
 ] 

Lars Hofhansl commented on HBASE-6060:
--

Ah. NM, there is a new review request now.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-30 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445620#comment-13445620
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@LArs
The latest one on the RB is https://reviews.apache.org/r/6636/.  It was created 
2 weeks back and already 2 diffs have been added to it.
Anyway this is for trunk.  If Stack and Jimmy are ok with this I can surely 
give you a patch for 0.94.  

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-28 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443255#comment-13443255
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Pls review the latest patch in the RB and provide your comments.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-16 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436101#comment-13436101
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Sorry for not updating the testcase TestOpenRegionhandler in two testcases.  I 
had done it only one testcase.  Sorry about that.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436159#comment-13436159
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12541239/HBASE-6060_latest.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 21 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 8 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.TestDistributedLogSplitting
  org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2598//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-15 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435258#comment-13435258
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

New review board request created. https://reviews.apache.org/r/6636/

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435353#comment-13435353
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12541080/HBASE-6060_latest.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 21 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 8 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.master.TestRollingRestart
  org.apache.hadoop.hbase.master.TestDistributedLogSplitting
  
org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2589//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, 
 HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13434459#comment-13434459
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12540918/HBASE-6060_latest.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 15 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 8 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler
  
org.apache.hadoop.hbase.regionserver.handler.TestCloseRegionHandler
  org.apache.hadoop.hbase.master.TestDistributedLogSplitting
  org.apache.hadoop.hbase.regionserver.TestSplitLogWorker

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2568//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-14 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13434821#comment-13434821
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Some testcases are failing.  Let me check and update the patch.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-08-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432906#comment-13432906
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

I am working on an updated patch.. Should be able to submit it by today or 
tomorrow.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-29 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424684#comment-13424684
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Jimmy/@Stack
If you can give me a couple of days i can give a patch.. Actually after 
discussing with Rajesh, the patch 5 has some things to be taken from Stack's 
patch up in RB.  And Stack's patch is not the final one as per the review 
comments over in RB.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-28 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424368#comment-13424368
 ] 

Jimmy Xiang commented on HBASE-6060:


Patch 5 is good to me. In addition to those assigned regions on the dead 
server, it also re-asssigns those regions in transition with destination server 
= the dead server.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424020#comment-13424020
 ] 

Jimmy Xiang commented on HBASE-6060:


@Rajesh, could you update your patch since HBASE-6272 is in, and put the diff 
on RB?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-27 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424142#comment-13424142
 ] 

stack commented on HBASE-6060:
--

@Jimmy I did the last patch.  I think you should review it for 'merit'.  If you 
think it worth it, I'll update to fit hbase-6272.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424209#comment-13424209
 ] 

Jimmy Xiang commented on HBASE-6060:


@Stack, no problem, doesn't have to.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.92.3, 0.94.2

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409359#comment-13409359
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Some comments on RB.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409639#comment-13409639
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Ok with approach stack. Few cases may still be hiding out there :)
Nice of you Stack.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409640#comment-13409640
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Do we need to test the perf because now the transition from OFFLINE to OPENING 
is done in the main thread?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409749#comment-13409749
 ] 

stack commented on HBASE-6060:
--

bq. Do we need to test the perf because now the transition from OFFLINE to 
OPENING is done in the main thread?

Yeah, its going to take longer for the rpc to complete though we are doing the 
same amount of 'work' opening a region.  I'm not worried about the single 
region open case.  I'd think we will add a few ticks.  But bulk opens will be 
interesting.  I've not made the change there, yet.  These try to be async so 
maybe it'll be ok there... will see.

What you think?



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-08 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408861#comment-13408861
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
I am still reviewing this.  Tomorrow i will give all my comments once i reach 
office.  Just need to see few things carefully. Thanks Stack.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-07 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408601#comment-13408601
 ] 

stack commented on HBASE-6060:
--

v3 should be good.  I posted it up on RB.  Addresses Ted and Ram's issues.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407748#comment-13407748
 ] 

stack commented on HBASE-6060:
--

@Lars Yes.

@Ted My fault.  See https://reviews.apache.org/r/5796/ instead.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408209#comment-13408209
 ] 

rajeshbabu commented on HBASE-6060:
---

@Stack
 In HRegionServer.openRegion() if zk transition from OFFLINE to OPENING fails 
we need to remove region from regionsInTransitionInRS of that RS as in 
OpenRegionHandler. Otherwise we end up with RegionAlreadyInTransitionException 
if we call assign to the same region(In bulk assign we will call assign in case 
of FAILED_OPENING)



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408242#comment-13408242
 ] 

stack commented on HBASE-6060:
--

@Ram Good one.  Will refresh patch up in RB.  Otherwise, do you think this 
approach will work.  Whats your thoughts on bulk assign problem?  Or on the 
variety of ways in which we can force OFFLINE a region?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408322#comment-13408322
 ] 

stack commented on HBASE-6060:
--

I added update up on RB to address Ted and Ram comments.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408388#comment-13408388
 ] 

Zhihong Ted Yu commented on HBASE-6060:
---

There seems to be compilation error for Stack's patch v2:
{code}
[ERROR] 
/Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[101,0]
 org.apache.hadoop.hbase.master.MockRegionServer is not abstract and does not 
override abstract method 
containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in 
org.apache.hadoop.hbase.regionserver.RegionServerServices
[ERROR] 
[ERROR] 
/Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[326,2]
 method does not override or implement a method from a supertype
[ERROR] 
[ERROR] 
/Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[45,7]
 org.apache.hadoop.hbase.util.MockRegionServerServices is not abstract and does 
not override abstract method 
containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in 
org.apache.hadoop.hbase.regionserver.RegionServerServices
[ERROR] 
[ERROR] 
/Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[92,2]
 method does not override or implement a method from a supertype
{code}

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408404#comment-13408404
 ] 

stack commented on HBASE-6060:
--

I can fix that.

But I just thought of something; the way I'm doing the test of a region in 
transition over on the RS is wrong, the stuff I added to address the Ram 
feedback. I can actually bump aside an ongoing RIT.  Will fix in my morning.  
Please do not review v2 yet Ram.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-05 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407482#comment-13407482
 ] 

stack commented on HBASE-6060:
--

@Ram Yes.

I put something for discussion up in RB: https://reviews.apache.org/r/5781/  If 
you and/or Rajesh get a chance, see what you think.

There are a million holes still as I see it in particular around bulk assign, 
etc. (See the TODO list).

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-05 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407545#comment-13407545
 ] 

Zhihong Ted Yu commented on HBASE-6060:
---

I only see Review button on the above URL - maybe it is my review board account 
that malfunctions.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-05 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407620#comment-13407620
 ] 

Lars Hofhansl commented on HBASE-6060:
--

It seems we are a bit away from a fix. Can this go into 0.94.2?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-26 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401542#comment-13401542
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
So for the latest version of patch, can we move the step where the node is 
changed from OFFLINE to OPENING and let the remaining part be in the 
OpenRegionHandler?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-18 Thread chunhui shen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393695#comment-13393695
 ] 

chunhui shen commented on HBASE-6060:
-

The lastest patch - HBASE-6060_trunk_5.patch is easy to understand, Nice!
But I have one doubt, 
{code}
 // We have sent the open rpc to the region server.  We hold this state until 
regionserver
+  // assumes control of the region by setting znode to OPENING from 
OFFLINE (This is an
+  // in-memory state only with no corresponding znode state
+  PENDING_OPEN,
+
+  // Set by zk callback when regionserver moves znode from OFFLINE to 
OPENING.  Regionserver now
+  // has control of the region.
+  OPENING,
{code}

Since we change the sense of PENDING_OPEN and OPENING,
should we do corresponding change? 

e.g. 
doing state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(),
plan.getDestination());
after 
doing serverManager.sendRegionOpen


Pardon me if wrong, thanks.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-18 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395828#comment-13395828
 ] 

rajeshbabu commented on HBASE-6060:
---

@chunhui
Sorry I forgot to add them in HBASE-6060_trunk_5.patch but available in 
HBASE-6060_trunk_4.patch.
Still there is one open problem with this solution. You can find here.
https://issues.apache.org/jira/browse/HBASE-6060?focusedCommentId=13293382page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13293382

Thanks.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295311#comment-13295311
 ] 

stack commented on HBASE-6060:
--

bq. But i agree two systems AM and SSH take decision based on that..

Sorry, I don't follow?  Are you saying we should rely on RegionState?   Or on 
RegionPlan?  (Plan is what we 'want' to happen; RegionState should be where 
regions actually 'are'. I'd think that we'd want to deal w/ current state 
making a decision rather than planned state?)

bq. One more thing on RegionState in RIT is that it is reactiveEvery time 
Assignment starts RegionState in RIT goes thro a set of steps and may be that 
is why we are not sure as in what step the RIT is and who made that change.

In SSH, we need the answer to one basic question only; i.e. who owns the 
region, master or regionserver.  Unless you have a better idea, I think setting 
of znode to OPENING before returning from the open rpc necessary to plug the 
gray area you fellas identified.

bq. We also tried thought of one approach, like can we remove the retry logic 
itself in assign?

It seems like the retry has been there a long time (Originally we just recursed 
calling assign on exception but in HBASE-3263 we changed it to a bounded loop). 
 It seems like the retry is ok since we'll try a different server if we fail on 
first plan?  We also like this single-assign method because it more rigorous 
about state management than bulk-assign?  Its only a problem if a concurrent 
assign when we are unsure who is responsible for the region.



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293382#comment-13293382
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

In the lastest patch there is one more problem.  Uploaded to just sync up if 
this what you are trying Stack.

Now after calling openregion if the OpenRegionHandler is spawned and just after 
that the RS goes down.And still the PENDING_OPEN is not yet updated and by the 
time SSH starts processing. Then as per the updated latest patch we will not be 
able to call assign.  No testcase runs with this patch needs modifications.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293390#comment-13293390
 ] 

stack commented on HBASE-6060:
--

bq. Now after calling openregion if the OpenRegionHandler is spawned and just 
after that the RS goes down.And still the PENDING_OPEN is not yet updated and 
by the time SSH starts processing. Then as per the updated latest patch we will 
not be able to call assign.

Can you explain more Ram?  The rpc open has been sent (successfully), the RS 
goes down, but we have not set it to PENDING_OPEN -- and at this time the SSH 
runs?  Is that what you are suggesting?  So, its the window between open rpc 
return, and the setting of PENDING_OPEN -- in this window, if the SSH runs to 
completion we'll have an unassigned region, one that will have to be picked up 
by the timeout monitor?  Is that what you are suggesting?

Thanks Ram (Sorry that this stuff is taking so long...)

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293449#comment-13293449
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Absolutely correct.  :) May be you have someother idea for that?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293551#comment-13293551
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12531775/HBASE-6060-trunk_4.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
  org.apache.hadoop.hbase.util.TestFSUtils
  org.apache.hadoop.hbase.catalog.TestMetaReaderEditor

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2145//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2145//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293706#comment-13293706
 ] 

stack commented on HBASE-6060:
--

@Ram So, the rpc returns successfully.  The very next step is to set 
RegionState to PENDING_OPEN.  Somehow, in between these two points, a timeout 
of the RS znode will expire, a verification of ROOT and META will take place, a 
full scan of META (a new RPC) will happen, and all before we get to setting the 
PENDING_OPEN state?  Seems unlikely.  If it ever happens, I suppose this would 
be one for the timeout monitor to clean up?  Or do you have a scenario where 
this case is more likely?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293767#comment-13293767
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

This is the open point that we see in moving the PENDING_OPEN after rpc.  As 
far as we brainstormed we did not find any other gaps in this.  But our initial 
patch that shares state across AM and SSH solves all problems. :). 
You have any new patch Stack? May be we need to work on testcases if the latest 
patch is fine with you..Thanks a lot for spending your time on this defect with 
your feedbacks and nice new ideas. 

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293849#comment-13293849
 ] 

rajeshbabu commented on HBASE-6060:
---

Patch with some test cases and corrections. TestAssignmentManager passes.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293883#comment-13293883
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12531867/HBASE-6060_trunk_5.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.TestAtomicOperation
  org.apache.hadoop.hbase.master.TestDistributedLogSplitting
  org.apache.hadoop.hbase.master.TestSplitLogManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2154//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2154//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293973#comment-13293973
 ] 

stack commented on HBASE-6060:
--

@Ram Thinking on it, the way to plug the hole you and Rajesh have identfied is 
by having the RS update the znode to OPENING form OFFINE before returning from 
the open rpc call.

The way I see it, we've found problems in our RegionState.  As it is now, we 
cannot ask RegionState to reliably find the state of a region.  Its so bad, you 
fellas went and made a solution figuring where a region is at using another 
system altogether, the RegionPlan state which strikes me as a little odd; 
because the system we should be relying on is broke -- RegionState -- you 
fellas look at the ghost traces of region moves, the impression left in 
RegionPlan.  Is this a fair characterization?

Don't you think we should fix RegionState while we're in here?

I've been working on a patch but was going to do it piecemeal, fix hbase-6199 
first?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294154#comment-13294154
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

bq.Is this a fair characterization?
RegionState is something which the master uses to see the current progress of 
the assignment. But regionplan is one thing which will tels me where the 
assignment is actually going on.
If master wants to start a new assignment its again the regionplan that gets 
changed.  So we thought of using that as an indicator. But i agree two systems 
AM and SSH take decision based on that..

One more thing on RegionState in RIT is that it is reactive.  Based on some 
changes in RS it gets updated. But RegionPlan is the one decided by the master.
Every time Assignment starts RegionState in RIT goes thro a set of steps and 
may be that is why we are not sure as in what step the RIT is and who made that 
change.

We also tried thought of one approach, like can we remove the retry logic 
itself in assign?  Always use SSH to trigger assignment as in TRUNK the expiry 
of server is very much spontaneous now.  May be some gaps are there which we 
have not yet found.

I should accept one thing is I did not try to change the RegionState in RIT and 
was seeing to use the RegionPlan only.

bq.RS update the znode to OPENING form OFFINE before returning from the open 
rpc call.
Then this step has to be moved from OpenRegionHandler?



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
 HBASE-6060_trunk_5.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-11 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292835#comment-13292835
 ] 

stack commented on HBASE-6060:
--

What do I have to change in that section of code?  It seems fine as is.  Or are 
you thinking we need to allow it being in OFFLINE state?

Yes, agree that before setting PENDING_OPEN, need to ensure it has not moved to 
OPENING.  That should be easy enough (in fact, its not possible now I think 
about it since there a lock on RegionState that surrounds the single-assign -- 
need to make sure handleRegion respects it).

What do you think of this approach Ram?  The one where we rely on RegionState 
rather than on RegionPlan?

I won't have the issue you see in Rajesh's patch, right?  I don't intend to 
collect regions from outstanding regionplans.  If the RS goes down before 
sending the RPC, then the single-assign will fail and retry but it won't be 
included in the SSH set? 



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-11 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292883#comment-13292883
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

bq. I don't intend to collect regions from outstanding regionplans
I think we need this.(like the one in earlier patches). Or else if the region 
has been moved to OPENING and the destination RS goes down how do we try to 
process them.  It will not be in META also right as it is still not opened in 
the destination RS.  So a logic similar to Rajesh's patch will be needed but 
that along with your suggestion needs some minor tweaks.  
@Rajesh
Correct me if am wrong.  This is what we discussed today? 
@Stack
We had a patch but could not upload it from office due to some other tasks. May 
be we could upload one by tomo morn once we reach office. 
Pls correct me if am wrong. (We discussed lot of scenarios today also, may be i 
can miss somethings here).



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-11 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292884#comment-13292884
 ] 

stack commented on HBASE-6060:
--

bq. Or else if the region has been moved to OPENING and the destination RS goes 
down how do we try to process them.

The RegionState will have the server that reported the region in OPENING.  We 
don't need to get it from RegionPlan (in AM#processServerShutdown, as part of 
the run through all RITs, we need to look for RegionStates that are against the 
server that has just died).

Looking forward to your patch tomorrow.  I'll upload one too (working on tests 
today).

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292474#comment-13292474
 ] 

rajeshbabu commented on HBASE-6060:
---

@Ram 
Attached patch. Just a suggestion to skip the regions in OFFLINE/PENDING_OPEN 
and RS went down before request sent.

But need to handle this case:

Lets suppose Region server went down after spawning OpenRegionHandler and 
before transitioning to OPENING then its SSH responsibility to assign regions 
in OFFLINE/PENDING_OPEN.

In this patch skipping regions to be assigned in case rs went down before 
request
{code}
  if ((region.getState() == RegionState.State.OFFLINE)
   (region.getState() == RegionState.State.PENDING_OPEN)) {
regionPlans.remove(region.getRegion());
  }
{code}


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292475#comment-13292475
 ] 

stack commented on HBASE-6060:
--

bq. But the gain with bulk assignment is its faster which we wanted in 
HBASE-5914.

Yes.  We are keeping bulk assigns, its just that the suggested patch favors the 
single-assign letting it do retries whereas the original patch wants to do all 
via bulk assign canceling any ongoing single-assigns.

bq. Only after the RS changes to OPENING the znode of the regions belongs to 
the RS.

Yes, this is the case for znode states.  The problem is the PENDING_OPEN state, 
an in-memory state kept in RegionState.  It exists across the sending of the 
rpc open to the regionserver until the OPENING callback comes into the master 
saying the regionserver has moved the znode from OFFLINE to OPENING.  This is 
the the messy state.

bq.  Infact this behaviour of using the owner of the node is the logic behind 
how we do processRIT on a master restart.

The context is different I think; there is no ongoing retry of single-assign 
here.  Its a cleaner case.

bq. But what if the RS goes down before sending the RPC call to RS and that is 
where the retry logic comes, which i feel is really needed.

Do you mean the master sending the RPC?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292476#comment-13292476
 ] 

stack commented on HBASE-6060:
--

@rajeshbabu So, you are allowing the single-assign to continue its retries?  
Regions that are OFFINE or PENDING_OPEN are not bulk assigned? (Yes, there is 
still the case of RS going down before it moves region to OPENING)

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292479#comment-13292479
 ] 

Hadoop QA commented on HBASE-6060:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12531590/6060_suggestion_toassign_rs_wentdown_beforerequest.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2134//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2134//console

This message is automatically generated.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292530#comment-13292530
 ] 

Zhihong Ted Yu commented on HBASE-6060:
---

For 6060_suggestion_toassign_rs_wentdown_beforerequest.patch:

Can you give the following variable better name ?
{code}
+SetHRegionInfo regionPlans = new ConcurrentSkipListSetHRegionInfo();
{code}
The set doesn't hold region plans. The following javadoc needs to be adjusted 
accordingly.
{code}
+   * @return Pair that has all regionplans that pertain to this dead server 
and a list that has
{code}
{code}
+  if ((region.getState() == RegionState.State.OFFLINE)
+   (region.getState() == RegionState.State.PENDING_OPEN)) {
{code}
A region cannot be in both states at the same time. '||' should be used instead 
of ''
{code}
+deadRegions = new TreeSetHRegionInfo(assignedRegions);
{code}
Since the fulfillment of deadRegions above is in a different code block from 
the following:
{code}
   if (deadRegions.remove(region.getRegion())) {
{code}
Running testSSHWhenSourceRSandDestRSInRegionPlanGoneDown (from v3) would lead 
to NPE w.r.t. deadRegions

After fixing the above, testSSHWhenSourceRSandDestRSInRegionPlanGoneDown still 
fails.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292549#comment-13292549
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Ted
Myself or Rajesh, either of us will investigate the test failures tomorrw once 
we reach office.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292589#comment-13292589
 ] 

stack commented on HBASE-6060:
--

Rajesh says:
bq. Lets suppose Region server went down after spawning OpenRegionHandler and 
before transitioning to OPENING then its SSH responsibility to assign regions 
in OFFLINE/PENDING_OPEN.

I say:
ba. Shouldn't region belong to master or regionserver without a gray area 
in-between while PENDING_OPEN is going on?

Ram says:

bq. Stack, in this case first the region belongs to master.  Only after the RS 
changes to OPENING the znode of the regions belongs to the RS.

So, Rajesh identifies a hole, I claim the hole is murky, inspecific, and Ram 
claims there is no hole really.

Below I argue that there is a hole and a small change cleans up RegionState 
states making SSH processing more clean.

Ram, what you say is true, if you are looking at znode states only.  If you are 
looking at RegionState, the in-memory reflection of what a regions' state is 
according to the master, then what PENDING_OPEN covers, a state that does not 
have a corresponding znode state, is unclear.  

I want to rely on whats in RegionState figuring what SSH should process 
(Rajesh's latest patch seems to want to walk this path too).

PENDING_OPEN spans the master sending the rpc open currently.  It is set before 
we do the rpc invocation so if the regionserver goes down, if a region's state 
is PENDING_OPEN, should it be handled by SSH or will it get retried by the 
single-assign method?  I can't tell for sure.  If the regionserver went down 
while the rpc was outstanding, the single-assign will retry.  It will actually 
set the RegionState back to OFFLINE temporarily -- which makes it even harder 
figuring whats going on if looking from another thread.  PENDING_OPEN as is is 
worse than useless.

How about this Ram and Rajesh

{code}
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
@@ -1715,14 +1715,18 @@ public class AssignmentManager extends 
ZooKeeperListener {
   try {
 LOG.debug(Assigning region  + 
state.getRegion().getRegionNameAsString() +
to  + plan.getDestination().toString());
-// Transition RegionState to PENDING_OPEN
-state.update(RegionState.State.PENDING_OPEN, 
System.currentTimeMillis(),
-plan.getDestination());
 // Send OPEN RPC. This can fail if the server on other end is is not 
up.
 // Pass the version that was obtained while setting the node to 
OFFLINE.
 RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan
 .getDestination(), state.getRegion(), versionOfOfflineNode);
-if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {
+if (regionOpenState.equals(RegionOpeningState.OPENED)) {
+  // Transition RegionState to PENDING_OPEN. It covers the period 
between the send of the
+  // rpc and our getting the callback setting the region state to 
OPENING.  This is
+  // in-memory only change.   Out in zk the znode is OFFLINE and we 
are waiting on
+  // regionserver to assume ownership by moving it to OPENING.
+  state.update(RegionState.State.PENDING_OPEN, 
System.currentTimeMillis(),
+plan.getDestination());
+} else if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {
   // Remove region from in-memory transition and unassigned node from 
ZK
   // While trying to enable the table the regions of the table were
   // already enabled.
{code}

Here we set region to be PENDING_OPEN AFTER we send the open rpc.  Now we know 
that a region that is PENDING_OPEN will not be retried by the single-assign and 
the state is clear; its that period post open rpc but before we get the znode 
callback which sets the RegionState to OPENING.

Over in SSH, I can safely add PENDING_OPEN regions to the set of those to bulk 
assign if they are against the dead server currently being processed.

What do you fellas think?

I need to look at OFFLINE states too to see if they will always get retried by 
single-assign.  If so, we can leave these out of the SSH recover.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-10 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292646#comment-13292646
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Few comments,
First of all we need to do some more changes if we want to make the above 
change in handleRegion
{code}
  // Should see OPENING after we have asked it to OPEN or additional
  // times after already being in state of OPENING
  if (regionState == null ||
  (!regionState.isPendingOpen()  !regionState.isOpening())) {
LOG.warn(Received OPENING for region  +
prettyPrintedRegionName +
 from server  + data.getOrigin() +  but region was in  +
 the state  + regionState +  and not  +
in expected PENDING_OPEN or OPENING states);
{code}
Second is, you mean like the above change of making pending_open after the RPC 
call along with the changes in the earlier patches like having the new items 
lik 'ritintersection' and 'outstandingRegionPlans'?
Here there is one concern is, if before we update the state to PENDING_OPEN in 
master side if the OpenRegionHandler has really transitioned the node to 
OPENING then the inmemory state will also be changing to OPENING after we 
correct the code above i have mentioned.  Now in that case the OPENING state 
will be rewritten to PENDING_OPEN? So we may need to add a check to see if 
already a change has happened in the REgionState.


The problem with Rajesh's patch is 
(6060_suggestion_toassign_rs_wentdown_beforerequest.patch:)
{code}
+  if ((region.getState() == RegionState.State.OFFLINE)
+   (region.getState() == RegionState.State.PENDING_OPEN)) {
+regionPlans.remove(region.getRegion());
+  }
{code}
What if the RS went down befor sending RPC. The SSH collected the regionPlan 
but before he could remove the collected regionplan in the above code, if the 
master completed the retry assignment and the OpenRegionHandler has changed to 
OPENING and the state is now Opening, then we will again try to assign thro SSH.

The above problem can happen in the change that you mentioned also like moving 
PENDING_OPEN after RPC.

So can we take a copy of RIT befor forming the RegionPlan and work based on 
that?  Will update the change that we are suggesting in some time.





 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-09 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292461#comment-13292461
 ] 

rajeshbabu commented on HBASE-6060:
---

@Stack

bq. It just treats OPENING regions that are on the dead server as it does 
CLOSING, CLOSED, etc., so it gets assigned as part of the SSH bulk assign.
We need to consider regions in OPEN state also(along with OPENING) for 
assignment.
bq. Should not cause a HBASE-5816 because that was our including in the bulk 
assign regions that could have been in OFFLINE/PENDING_OPEN RegionState; i.e. 
regions being handled by the 'normal', single assign.

you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN while 
processing server shutdown, because they will be assigned as part of retry?
If this is the case there is one problem 
- Lets suppose Region server went down after spawning OpenRegionHandler and 
before transitioning to OPENING then its SSH responsibility to assign regions 
in OFFLINE/PENDING_OPEN.

Alternative 2 itself may not solve HBASE-6060 because
{code}
  hris = 
MetaReader.getServerUserRegions(this.server.getCatalogTracker(),
this.serverName);
{code}
during opening region if destination server details not updated in META then 
the region wont present in hris. In that case we are not calling assign for the 
regions which are opening( region info not present in hris) on the region 
server(waiting for timeout monitor to trigger assignment).

Pls correct me if i am wrong.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292470#comment-13292470
 ] 

stack commented on HBASE-6060:
--

bq. We need to consider regions in OPEN state also(along with OPENING) for 
assignment.

Yes.  Makes sense.

bq. you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN 
while processing server shutdown, because they will be assigned as part of 
retry?

Yes.

bq. Lets suppose Region server went down after spawning OpenRegionHandler and 
before transitioning to OPENING then its SSH responsibility to assign regions 
in OFFLINE/PENDING_OPEN.

This is a good one.  I'd argue this is a hole in our transition states.  Region 
is OFFLINE but the regionserver is supposed to have control.  I'd suggest that 
a regionserver not return from open region until its moved the znode from 
PENDING_OPEN to OPENING (need to look more into what would be involved making 
this happen -- and what to do on bulk assign though it seems we drop a bunch of 
the handling already when bulk assigning).  Shouldn't region belong to master 
or regionserver without a gray area in-between while PENDING_OPEN is going on?

bq. during opening region if destination server details not updated in META 
then the region wont present in hris...

This is also a good one.  I found it earlier this afternoon making your tests 
work w/ the alternate patch (There is another 'hole' that I found in the case 
where there are no regions on the server that died; in this case we'll skip out 
early and not find associated OPENING or OPEN regions -- my patch is much 
larger than four lines now (smile)).

What do you think of this alternate solution Rajeshbabu?  Assuming I close the 
above holes, do you think it easier to reason about?  We don't have to keep up 
state distributed across AM and SSH (I think it more clear whats going on), we 
don't have to use a special RegionPlan to signal the single region assign 
method to give up on its assign (I'd think we'd want regions assigned by the 
single-assign call rather than the bulk assign since the single assign does 
more checking), and we are not built atop the current state of regionplans, 
which seems a little bit of an indirect foundation given we are trying to keep 
up RegionState in regionsInTransition.

Thanks for the thorough review.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292473#comment-13292473
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Rajesh
Thanks for taking care of the reviews.
@Stack
bq.I'd think we'd want regions assigned by the single-assign call rather than 
the bulk assign since the single assign does more checking)
Yes, I agree. But the gain with bulk assignment is its faster which we wanted 
in HBASE-5914.
bq.Shouldn't region belong to master or regionserver without a gray area 
in-between while PENDING_OPEN is going on?
Stack, in this case first the region belongs to master.  Only after the RS 
changes to OPENING the znode of the regions belongs to the RS. This still 
happens as per the current code.  Infact this behaviour of using the owner of 
the node is the logic behind how we do processRIT on a master restart.
{code}
else if (data.getOrigin() != null 
!serverManager.isServerOnline(data.getOrigin())) {
  // to handle cases where offline node is created but sendRegionOpen
  // RPC is not yet sent
  addToRITandCallClose(regionInfo, RegionState.State.OFFLINE, data);
{code}
But what if the RS goes down before sending the RPC call to RS and that is 
where the retry logic comes, which i feel is really needed.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
 HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291562#comment-13291562
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

bq.a Map of dead servers to their current set of region plans and 
regions-in-transition.
We are only maintaining regions and the RITS. not the regionplans
bq.Does this sound right?
Yes.  Exactly.  All this decision made on the current regionPlan.
bq.Is this patch even handling the root cause of this JIRA: i.e. dealing w/ 
OPENING znodes that were made against the dead server?
Yes.  
bq.In AM#processServerShutdown, we iterate the list of regions we get back from 
AM#this.servers and from this set we will remove RIT. But won't this original 
set of regions be missing regions that are OPENING or even OPENED but not yet 
handled because its only after the OPENED region has updated its znode and the 
znode has been removed by the OPEN handler do we add a region to 
AM#this.servers? Or is the OPENING region handled elsewhere?
To answer this question, for the first part.  If region R1 was initially in RS 
A and now it is getting moved to RS B.  If the RS A is going down then the 
existing SSH flow for RS A will get the regions from AM#this.servers and remove 
the same from this.regions.  Because already something is in RIT it will not 
assign it and so assignment happens smoothly to RS B.

Now take a case where the RS B goes down.  In the existing code no where we 
assigned those regions and left to the TM to assign.  Now that is where this 
patch comes into place and tries to assign it based on the regionplan.  When 
SSH sees that the dead server is in destination of the regionplan it will add 
to the new data structure and based on that it will start the assginment.  
Surely this region will not be in AM#this.servers as RS B is not the owner of 
this.

To answer the 2nd question about OPENED regions.  Again this should not be a 
problem.  There are 2 cases in this
1 OpenedRegionHandler is completed. - Here this means that this.servers is 
already updated.  And the region plan is also cleared.  Now when SSH starts it 
will see the region in AM.this.servers and also there is nothing in RIT.  So 
SSH will scan the meta find this region and go and assign it to a new RS. 
(Patch is not needed here)
2 OpenedRegionHandler is not yet completed. - Here there may be case 
OpenedRegionHandler has not yet done his work.  So the regionPlan is still 
available but the this.servers and this.regions are not yet updated.  Now as 
per patch we know that 
{code}
  if (rit != null  !rit.isClosing()  !rit.isPendingClose()  
!rit.isSplitting()
+ !regionsFromRegionPlansForServer.contains(rit.getRegion())) 
{
{code}
The new datastructure contatins that region and hence we don not skip the 
assignment and go ahead with assignment. This region got added here because we 
have it in Regionplan with the destination as the dead server.
bq.we can do to better encapsulate this new state management? 
Yes we will try to do that.

bq.Can't we just keep list of dead servers and any attempt at getting a region 
plan related to the dead server or server being processed as dead is put off 
because SSH will get to it?
Already we are doing this.  But we cannot guarentee at what point the RS had 
gone down, is it after forming the regionPlan or before forming the regionplan. 
 IF it is before forming the regionplan
we are already excluding the dead servers.  

Pls correct me if am wrong and if you have any doubts

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291565#comment-13291565
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

bq. This patch will not fix a concurrent balance that goes to move a region and 
a concurrent user-orchestrated move?
Currently does not handle this.  But if we are sure that the region assignments 
goes smoothly after this patch we can surely have a mechanism to initmate the 
user that some plan is already in progress and wait for that to complete before 
we initiate the user triggered move or assign.  
@Enis
bq.very nervous about all the state combinations distributed between zk / meta 
/ rs-memory and master-memory.
Can you tell me any problems that you foresee.  Happy to address that too here. 
 Infact we have not introduced any new states.  Just using the maps and 
zookeeper states available.  The REgionplan state we introduced here is 
something internal to know the difference which we will address in our next 
patch as per Stack's suggestion.
Your comments and feedback are always welcome.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291567#comment-13291567
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack 
Our comments crossed.. Once i refresehed saw you have attached a patch.  Let me 
take a look at it.  Thanks :)

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291568#comment-13291568
 ] 

stack commented on HBASE-6060:
--

@Enis Yeah, its a big problem.

Ram and Rajeshbabu, I just uploaded a spin on your patch.  Its a bit more basic.

It gets rid of REGION_PLAN_ALREADY_INUSE and returns null out of getRegionPlan 
instead.

If a null region, we update the no servers flag if no regions else presume 
region assignment being done else where (in SSH or in timeoutmonitor).

In AM, I changed the processServerShutdown to instead be 
processServerShutdownStart with a corresponding processServerShutdownFinish.  
When the first is called, we add the servername being processed to a list.  
When the latter is called, we remove it.  Getting a regionplan, we now exclude 
servers being processed from the set of assignable servers.  This simplifies 
getRegionPlan.

processServerShutdownStart returns the Set of outstandingRegionPlans -- plans 
that were related to the server that is being processed by shutdown handler -- 
and the intersection of regions-in-transition with the set of regions the 
server was carrying.  The patch then does as you were doing minus the removing 
of region from the Set that was kept back in AM.

I've not verified this passes tests yet.  It still needs more work. Just 
putting it up as more feedback on your posted patch.

Regards Enis comment, I tried poking around some and its hard to follow state 
transitions across memory, zk, and then across retries and force repossessions. 
 Any ideas for how we'd simplify all of this?   I started to pull out a 
RegionsInTransition class.  It would be a 'database' backed by zk.  There would 
be no more in memory state, just whats out in zk.  You'd go to this new RIT db 
to get current state and to ask it make transistions.  Just a thought.  I 
haven't dug in enough.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291575#comment-13291575
 ] 

stack commented on HBASE-6060:
--

bq. We are only maintaining regions and the RITS. not the regionplans

Yes.  I should have said that.

Thanks for the explanation of OPENING.  I now understand.  Do you think we 
could get your explanation into comments in the code in any way to help those 
that come after?

bq. ... But we cannot guarentee at what point the RS had gone down, is it after 
forming the regionPlan or before forming the regionplan. IF it is before 
forming the regionplan
we are already excluding the dead servers.

Hmm... ok.  That means that in the patch I uploaded, the new Set of servers 
being processed by shutdown handler is redundant with the deadServers Set that 
the ServerManager is keeping up.  I can simplify my patch further (if you think 
the way getRegionPlan works in my patch is ok).



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, 
 HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291755#comment-13291755
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Just to reiterate this JIRA is a backport of HBASE-5396.  The current patch 
that you gave is similar to HBASE-5396 and it will solve the problem of 
assigning a region that has started opening in a RS but the RS went down.
But if we don't have any common sharing between AM and SSH
{code}
RegionPlan regionPlan = getRegionPlan(regionState, sn, true);
{code}
In the first assign call we form a regionplan.  It so happened that we got the 
plan and issued sendRegionOpen().  By the time the RS had gone down.  Now as 
per your patch(HBASE-5396) the SSH will trigger and he will start a new 
assignment.  At the same time 
{code}
LOG.warn(Failed assignment of  +
  state.getRegion().getRegionNameAsString() +  to  +
  plan.getDestination() + , trying to assign elsewhere instead;  +
  retry= + i, t);
// Clean out plan we failed execute and one that doesn't look like it'll
// succeed anyways; we need a new plan!
// Transition back to OFFLINE
state.update(RegionState.State.OFFLINE);
// Force a new plan and reassign.  Will return null if no servers.
RegionPlan newPlan = getRegionPlan(state, 
plan.getDestination(), true);
if (isNullPlan(newPlan, state, plan.getDestination())) {
  // Whether no servers to assign too or region assignment is 
being handled elsewhere,
  // skip out of this retry loop.
  break;
{code}
Excluding this server we will call one more assign because of retry logic.  And 
that is where there is a chance of getting the master abort problem.  In the 
current patch that u gave getRegionPlan will create the above problem.

This is where we need to have a common structure and also a new state of region 
plan to say whether to make another assign call or not thro AM.

{code}
if (!hijack  !state.isClosed()  !state.isOffline()) {
  String msg = Unexpected state :  + state +  .. Cannot transit it to 
OFFLINE.;
  this.master.abort(msg, new IllegalStateException(msg));
  return -1;
}
{code}
Now the problem of master abort was reported in HBASE-5816.(Because HBASE-5396 
is committed to 0.90).  Now what we are doing is to solve this as a whole so 
that we don't end up in any double assignment or any problem with master abort.

One more problem in patch is 'outstandingRegionPlans' should be cleared here 
{code}
} else if (addressFromAM != null  !addressFromAM.equals(this.serverName)) {
+  LOG.debug(Skip assigning region  + 
e.getKey().getRegionNameAsString()
+  +  because it has been opened in  + 
addressFromAM.getServerName());
+}
{code}
STack i feel that sharing something common is necessary to avoid this problem.  
Just to add, HBASE-6147 i have given one patch, which when combined with the 
patch here will solve many other issues also.  

Do let me know if you still feel there are some more doubts on this?



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion2_based_off_v3.patch, 
 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291835#comment-13291835
 ] 

stack commented on HBASE-6060:
--

bq. Excluding this server we will call one more assign because of retry logic.  
And that is where there is a chance of getting the master abort problem.  In 
the current patch that u gave getRegionPlan will create the above problem.

Because we will create a new plan and though the new plan will not land on the 
server that is being processed by shutdown handler, the assign trying to update 
the znode will bump into the illegalstateexception because SSH may be at the 
same time trying to assign?  If so, that is true.  That is a problem.

But Ram, the problem w/ your patch is that getRegionPlan is being called with 
forceNewPlan=true and you are NOT forcing a new plan, right?, which is going to 
confuse no end?  We should break up getRegionPlan into region plan getting and 
then another method on whether to proceed w/ assign?

bq. One more problem in patch is 'outstandingRegionPlans' should be cleared 
here...

Yes.  That is to prevent it being reassigned by SSH.  Good one.

I appreciate your bringing me along.  Will look more when get to work will 
look at hbase-6147 too.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion2_based_off_v3.patch, 
 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292114#comment-13292114
 ] 

stack commented on HBASE-6060:
--

Just some notes after looking more at this stuff.

We have two different assigns; one that assigns a single region and then 
another that assigns multiple regions in the one go.  The first takes out a 
lock on the RegionState before it runs and is careful about using existing 
assignment plans if they exist.  It'll prevent concurrent assigns but it does 
not prevent hbase-5816 where to assigns of same region around the same time 
will cause master exit because the second assign will find zk in illegal 
state.

The second assign does not respect existing plans nor does it synchronize on 
RegionState.  The second assign type is used bulk assigning. SSH now does bulk 
assign, since hbase-5914, a recent change, once its figured regions that need 
assigning.  The patch in here is careful about making sure the bulk assign is 
favored.  It tries to undo any ongoing assigns that are happening via the first 
method.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, 6060_suggestion2_based_off_v3.patch, 
 6060_suggestion_based_off_v3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-07 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291416#comment-13291416
 ] 

stack commented on HBASE-6060:
--

I think I understand this patch now.

In AM, we introduce a new Map, a Map of dead servers to their current set of 
region plans and regions-in-transition.

This Map is populated in AM when SSH calls AM to say process a server that just 
died (processServerShutdown).

If we try to assign a region in another thread at this time, a check by AM into 
this new Map will cause us to return a 'DON'T DO ANYTHING' flag because its 
being done elsewhere (in SSH eventually -- and we return a flag rather than 
null because null triggers a different kind of special handling).

If SSH is running, it is what is meant to prevail assigning regions rather than 
any balance or user assignment that may have been made concurrently.

When it finishes, SSH clears the dead server from the AM Map.

Does this sound right?

I have some feedback on details of this patch but I would first like to get 
high-level comments out of the way.

1. Is this patch even handling the root cause of this JIRA: i.e. dealing w/ 
OPENING znodes that were made against the dead server?  In 
AM#processServerShutdown, we iterate the list of regions we get back from 
AM#this.servers and from this set we will remove RIT.  But won't this original 
set of regions be missing regions that are OPENING or even OPENED but not yet 
handled because its only after the OPENED region has updated its znode and the 
znode has been removed by the OPEN handler do we add a region to 
AM#this.servers?  Or is the OPENING region handled elsewhere?

2. This patch distributes state and updating of the state across AM and SSH.  
It makes it tough to follow what is going on (least, it makes it tough for 
lesser mortals like myself to figure it).  The new Map is populated as a 
side-effect of the call to AM#processServerShutdown.  The dead server is 
cleared explicitly from the AM Map on the tail of SSH on its way out (Reading 
SSH only, you'd be clueless why this was being done).  During the processing of 
SSH, we are updating a data structure that is held by this Map over in AM.  Is 
there anything we can do to better encapsulate this new state management?  What 
is we changed the method in AM from processServerShutdown to 
startProcessingServerShutdown and closed it off with a 
finishProcessingServerShutdown?  Do we have to keep around a list of 
regionplans?  Can't we just keep list of dead servers and any attempt at 
getting a region plan related to the dead server or server being processed as 
dead is put off because SSH will get t
 o it?

Here are some minor items while they are on my mind:

1. On REGION_PLAN_ALREADY_INUSE, why not return null and if we get a null back 
from getRegionPlan in assign, check  
serverManager.createDestinationServersList(serverToExclude) is empty; if it is, 
update timeoutmonitor.  This'd be cleaner than adding this override of 
RegionState data structure to do more than just carry a plan?
2. On the tail of SSH, we do 
this.services.getAssignmentManager().removeDeadServerRegions(this.serverName);  
Just before that we clear regionsFromRegionPlansForServer and 
regionsInTransition.  The clearings are unnecessary?

Good stuff Ram and Rajeshbabu.  I think this patch can work after a little 
cleanup.  I'll have some more feedback this evening.  Thought I'd send this in 
meantime so you have something to chew on when you get up this morning.


 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-07 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291426#comment-13291426
 ] 

stack commented on HBASE-6060:
--

My other high level comment is:

3. This patch will not fix a concurrent balance that goes to move a region and 
a concurrent user-orchestrated move?

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-07 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291488#comment-13291488
 ] 

Enis Soztutar commented on HBASE-6060:
--

This issue, and the other related issues Ram has recently fixed makes me very 
nervous about all the state combinations distributed between zk / meta / 
rs-memory and master-memory. After this is done, do you think we can come up 
with a simpler design? I do not have any particular idea, so just spitballing 
here.  

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-06 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289974#comment-13289974
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

bq.Are you thinking that the patches posted earlier address this issue and 
hbase-5816?
Yes.  It addresses HBASe-5396, HBASE-6060 and HBASe-5816.
bq.If we are to continue to have multiple threads doing assign, then we need to 
have each claim ownership of the region first by writing zk and then verifying 
the seqid each time they move the region state. If they fali to grab the znode, 
then someone else is operating on it and they should give up.
This is already happening as per the current code. Before master tries to write 
the data on the znode there is a version check.  Even when RS tries to update 
the znode there is a version check.  But on top of this there are also some 
inmemory states in master which is getting updated by various threads. And that 
will also lead to some abnormalities like
{code}
if (!hijack  !state.isClosed()  !state.isOffline()) {
  String msg = Unexpected state :  + state +  .. Cannot transit it to 
OFFLINE.;
  this.master.abort(msg, new IllegalStateException(msg));
  return -1;
}
{code} 
leading to issues like HBASE-5816.  One more thing if you see in the patches 
attached here,  logic of manipulating the region plan happens in the 
synchronized(regionplans) block.  Thus helping us to make only one assign 
thread to succeed.  Either AM or SSH.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289979#comment-13289979
 ] 

stack commented on HBASE-6060:
--

On znode check, yes, the rs is doing the checks before it does any transition.  
This is not the case in master when SSH or TimeoutMonitor assume they own a 
region (else we would not be running into illegalstateexceptions).

Agree, the updating of in-memory states also needs review.

Synchronize on regionplans making change will shut out other threads while 
regionplan is being manipulated but after we leave the synchronization, is 
having a region named REGION_PLAN_ALREADY_INUSE the only way we have of 
blocking another thread trying to operate on a region?  Can we not check RIT or 
can we not look at the RegionPlan to see if its in process being assigned (we 
could add another field to RegionPlan with where it is in the assignment 
process?)

I'll rereview your last patch in morning w/ the above in mind.  Thanks for 
bringing me along.

 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-06 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289984#comment-13289984
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Thanks Stack.
Just to give more details so that you can review with these points also in mind.
When SSH starts processing
{code}
this.services.getAssignmentManager().
  processServerShutdown(this.serverName);
{code}
{code}
 synchronized (this.regionPlans) {
+  this.deadServerRegionsFromRegionPlan.put(sn, regionsOnDeadServer);
   for (Iterator Map.EntryString, RegionPlan i =
   this.regionPlans.entrySet().iterator(); i.hasNext();) {
 Map.EntryString, RegionPlan e = i.next();
@@ -3210,9 +3265,11 @@
 // The name will be null if the region is planned for a random assign.
 if (otherSn != null  otherSn.equals(sn)) {
   // Use iterator's remove else we'll get CME
+  regionsFromRegionPlansForServer.add(e.getValue().getRegionInfo());
   i.remove();
 }
   }
+  
regionsOnDeadServer.setRegionsFromRegionPlansForServer(regionsFromRegionPlansForServer);
{code}
If the SSH first sees that the region plan was part of the destination that had 
gone down, as per the patch we add the region plan into the new datastructure.  

This time the getREgionPlan will be blocked as it is synchronzied.  
Now after this step when getRegionPlan proceeds we again synchronize on 
regionplans
{code}
synchronized (this.regionPlans) {
.
.
+  if (serverToExclude != null) {
+RegionsOnDeadServer regionsOnDeadServer = 
this.deadServerRegionsFromRegionPlan
+.get(serverToExclude);
+if (regionsOnDeadServer != null
+ regionsOnDeadServer.getRegionsFromRegionPlansForServer().
+contains(state.getRegion())) {
+  if (newPlan) {
+this.regionPlans.remove(randomPlan.getRegionName());
+LOG
+.info(Server shutdown handler already in progress for the region 
++ randomPlan.getRegionName());
+randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE;
+  } else {
+this.regionPlans.remove(existingPlan.getRegionName());
+LOG
+.info(Server shutdown handler already in progress for the region 
++ existingPlan.getRegionName());
+existingPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE;
+  }
+}
+  }
{code}
Now if the getRegionPlan will see that already SSH populated the datastructure 
and so he will say do not use this plan. And hence the assign does not happen.
Next if suppose getRegionPlan happens first, any way he will populate a new 
region plan and add it into the region plan and for sure it will not be the 
server that has gone down becuase it will not be in onlineServerlist.
Now after this if SSH processServerShutdown starts, when he starts to iterate 
the region Plan he will not find the destination server as the dead server and 
hence SSH need  not bother about it.

bq.(we could add another field to RegionPlan with where it is in the assignment 
process?)
In our first patch we have used some boolean flag to say if to use the plan or 
not. Later changed it to special states.







 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - 

[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-06-06 Thread chunhui shen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289988#comment-13289988
 ] 

chunhui shen commented on HBASE-6060:
-

@ram
Thanks to explain the detail about double assignment.

Is there any availability that only one thread doing assign/unassign work?
I mean we only allow one thread to handle RIT and other threads only add region 
to RIT



 Regions's in OPENING state from failed regionservers takes a long time to 
 recover
 -

 Key: HBASE-6060
 URL: https://issues.apache.org/jira/browse/HBASE-6060
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Enis Soztutar
Assignee: rajeshbabu
 Fix For: 0.96.0, 0.94.1, 0.92.3

 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch


 we have seen a pattern in tests, that the regions are stuck in OPENING state 
 for a very long time when the region server who is opening the region fails. 
 My understanding of the process: 
  
  - master calls rs to open the region. If rs is offline, a new plan is 
 generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
 master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
 HMaster.assign()
  - RegionServer, starts opening a region, changes the state in znode. But 
 that znode is not ephemeral. (see ZkAssign)
  - Rs transitions zk node from OFFLINE to OPENING. See 
 OpenRegionHandler.process()
  - rs then opens the region, and changes znode from OPENING to OPENED
  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
 state, and the master just waits for rs to change the region state, but since 
 rs is down, that wont happen. 
  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
 against these kind of conditions. It periodically checks (every 10 sec by 
 default) the regions in transition to see whether they timedout 
 (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
 which explains what you and I are seeing. 
  - ServerShutdownHandler in Master does not reassign regions in OPENING 
 state, although it handles other states. 
 Lowering that threshold from the configuration is one option, but still I 
 think we can do better. 
 Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >