[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292646#comment-13292646 ] ramkrishna.s.vasudevan commented on HBASE-6060: --- @Stack Few comments, First of all we need to do some more changes if we want to make the above change in handleRegion {code} // Should see OPENING after we have asked it to OPEN or additional // times after already being in state of OPENING if (regionState == null || (!regionState.isPendingOpen() && !regionState.isOpening())) { LOG.warn("Received OPENING for region " + prettyPrintedRegionName + " from server " + data.getOrigin() + " but region was in " + " the state " + regionState + " and not " + "in expected PENDING_OPEN or OPENING states"); {code} Second is, you mean like the above change of making pending_open after the RPC call along with the changes in the earlier patches like having the new items lik 'ritintersection' and 'outstandingRegionPlans'? Here there is one concern is, if before we update the state to PENDING_OPEN in master side if the OpenRegionHandler has really transitioned the node to OPENING then the inmemory state will also be changing to OPENING after we correct the code above i have mentioned. Now in that case the OPENING state will be rewritten to PENDING_OPEN? So we may need to add a check to see if already a change has happened in the REgionState. The problem with Rajesh's patch is (6060_suggestion_toassign_rs_wentdown_beforerequest.patch:) {code} + if ((region.getState() == RegionState.State.OFFLINE) + && (region.getState() == RegionState.State.PENDING_OPEN)) { +regionPlans.remove(region.getRegion()); + } {code} What if the RS went down befor sending RPC. The SSH collected the regionPlan but before he could remove the collected regionplan in the above code, if the master completed the retry assignment and the OpenRegionHandler has changed to OPENING and the state is now Opening, then we will again try to assign thro SSH. The above problem can happen in the change that you mentioned also like moving PENDING_OPEN after RPC. So can we take a copy of RIT befor forming the RegionPlan and work based on that? Will update the change that we are suggesting in some time. > Regions's in OPENING state from failed regionservers takes a long time to > recover > - > > Key: HBASE-6060 > URL: https://issues.apache.org/jira/browse/HBASE-6060 > Project: HBase > Issue Type: Bug > Components: master, regionserver >Reporter: Enis Soztutar >Assignee: rajeshbabu > Fix For: 0.96.0, 0.94.1, 0.92.3 > > Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, > 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, > 6060-trunk_3.patch, 6060_alternative_suggestion.txt, > 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, > 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, > HBASE-6060-92.patch, HBASE-6060-94.patch > > > we have seen a pattern in tests, that the regions are stuck in OPENING state > for a very long time when the region server who is opening the region fails. > My understanding of the process: > > - master calls rs to open the region. If rs is offline, a new plan is > generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in > master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), > HMaster.assign() > - RegionServer, starts opening a region, changes the state in znode. But > that znode is not ephemeral. (see ZkAssign) > - Rs transitions zk node from OFFLINE to OPENING. See > OpenRegionHandler.process() > - rs then opens the region, and changes znode from OPENING to OPENED > - when rs is killed between OPENING and OPENED states, then zk shows OPENING > state, and the master just waits for rs to change the region state, but since > rs is down, that wont happen. > - There is a AssignmentManager.TimeoutMonitor, which does exactly guard > against these kind of conditions. It periodically checks (every 10 sec by > default) the regions in transition to see whether they timedout > (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, > which explains what you and I are seeing. > - ServerShutdownHandler in Master does not reassign regions in OPENING > state, although it handles other states. > Lowering that threshold from the configuration is one option, but still I > think we can do better. > Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please co
[jira] [Updated] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Shi updated HBASE-6195: Attachment: HBASE-6195-trunk-V2.patch The previous patch didn't mv the now variable into the lock, fix it. > Increment data will lost when the memstore flushed > -- > > Key: HBASE-6195 > URL: https://issues.apache.org/jira/browse/HBASE-6195 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Xing Shi > Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk.patch > > > There are two problems in increment() now: > First: > I see that the timestamp(the variable now) in HRegion's Increment() is > generated before got the rowLock, so when there are multi-thread increment > the same row, although it generate earlier, it may got the lock later. > Because increment just store one version, so till now, the result will still > be right. > When the region is flushing, these increment will read the kv from snapshot > and memstore with whose timestamp is larger, and write it back to memstore. > If the snapshot's timestamp larger than the memstore, the increment will got > the old data and then do the increment, it's wrong. > Secondly: > Also there is a risk in increment. Because it writes the memstore first and > then HLog, so if it writes HLog failed, the client will also read the > incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Shi updated HBASE-6195: Attachment: HBASE-6195-trunk.patch The patch. > Increment data will lost when the memstore flushed > -- > > Key: HBASE-6195 > URL: https://issues.apache.org/jira/browse/HBASE-6195 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Xing Shi > Attachments: HBASE-6195-trunk.patch > > > There are two problems in increment() now: > First: > I see that the timestamp(the variable now) in HRegion's Increment() is > generated before got the rowLock, so when there are multi-thread increment > the same row, although it generate earlier, it may got the lock later. > Because increment just store one version, so till now, the result will still > be right. > When the region is flushing, these increment will read the kv from snapshot > and memstore with whose timestamp is larger, and write it back to memstore. > If the snapshot's timestamp larger than the memstore, the increment will got > the old data and then do the increment, it's wrong. > Secondly: > Also there is a risk in increment. Because it writes the memstore first and > then HLog, so if it writes HLog failed, the client will also read the > incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292626#comment-13292626 ] Xing Shi commented on HBASE-6195: - Oh sorry, there is no delete in my test case. > Increment data will lost when the memstore flushed > -- > > Key: HBASE-6195 > URL: https://issues.apache.org/jira/browse/HBASE-6195 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Xing Shi > > There are two problems in increment() now: > First: > I see that the timestamp(the variable now) in HRegion's Increment() is > generated before got the rowLock, so when there are multi-thread increment > the same row, although it generate earlier, it may got the lock later. > Because increment just store one version, so till now, the result will still > be right. > When the region is flushing, these increment will read the kv from snapshot > and memstore with whose timestamp is larger, and write it back to memstore. > If the snapshot's timestamp larger than the memstore, the increment will got > the old data and then do the increment, it's wrong. > Secondly: > Also there is a risk in increment. Because it writes the memstore first and > then HLog, so if it writes HLog failed, the client will also read the > incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292625#comment-13292625 ] Xing Shi commented on HBASE-6195: - Here is the data: I delete the row first, and then use 2000 threads to increment one row, each increment 1000, after all threads done, I read the increment row's value, do 11 times. for i in `seq 0 10` do /home/shubao.sx/hadoop-0.20.2-cdh3u3/bin/hadoop --config /home/shubao.sx/0.90-hadoop-config jar /home/shubao.sx/inc-no-delete/inc.jar com.taobao.hbase.MultiThreadsIncrement --threadNum 2000 --inc 1000 >/home/shubao.sx/inc-no-delete/inc.$i.log done and the results: inc.0.log : return 199838 inc.1.log : return 399729 inc.2.log : return 599579 inc.3.log : return 799441 inc.4.log : return 999305 inc.5.log : return 1199173 inc.6.log : return 1399037 inc.7.log : return 1598939 inc.8.log : return 1798804 inc.9.log : return 1998708 inc.10.log : return 2198637 Because I set the hlog's parameter hbase.regionserver.logroll.multiplier 0.005 hbase.regionserver.maxlogs 3 so the memstore flush occurs often. > Increment data will lost when the memstore flushed > -- > > Key: HBASE-6195 > URL: https://issues.apache.org/jira/browse/HBASE-6195 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: Xing Shi > > There are two problems in increment() now: > First: > I see that the timestamp(the variable now) in HRegion's Increment() is > generated before got the rowLock, so when there are multi-thread increment > the same row, although it generate earlier, it may got the lock later. > Because increment just store one version, so till now, the result will still > be right. > When the region is flushing, these increment will read the kv from snapshot > and memstore with whose timestamp is larger, and write it back to memstore. > If the snapshot's timestamp larger than the memstore, the increment will got > the old data and then do the increment, it's wrong. > Secondly: > Also there is a risk in increment. Because it writes the memstore first and > then HLog, so if it writes HLog failed, the client will also read the > incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6195) Increment data will lost when the memstore flushed
Xing Shi created HBASE-6195: --- Summary: Increment data will lost when the memstore flushed Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292589#comment-13292589 ] stack commented on HBASE-6060: -- Rajesh says: bq. Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN. I say: ba. Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on? Ram says: bq. Stack, in this case first the region belongs to master. Only after the RS changes to OPENING the znode of the regions belongs to the RS. So, Rajesh identifies a hole, I claim the hole is murky, inspecific, and Ram claims there is no hole really. Below I argue that there is a hole and a small change cleans up RegionState states making SSH processing more clean. Ram, what you say is true, if you are looking at znode states only. If you are looking at RegionState, the in-memory reflection of what a regions' state is according to the master, then what PENDING_OPEN covers, a state that does not have a corresponding znode state, is unclear. I want to rely on whats in RegionState figuring what SSH should process (Rajesh's latest patch seems to want to walk this path too). PENDING_OPEN spans the master sending the rpc open currently. It is set before we do the rpc invocation so if the regionserver goes down, if a region's state is PENDING_OPEN, should it be handled by SSH or will it get retried by the single-assign method? I can't tell for sure. If the regionserver went down while the rpc was outstanding, the single-assign will retry. It will actually set the RegionState back to OFFLINE temporarily -- which makes it even harder figuring whats going on if looking from another thread. PENDING_OPEN as is is worse than useless. How about this Ram and Rajesh {code} +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java @@ -1715,14 +1715,18 @@ public class AssignmentManager extends ZooKeeperListener { try { LOG.debug("Assigning region " + state.getRegion().getRegionNameAsString() + " to " + plan.getDestination().toString()); -// Transition RegionState to PENDING_OPEN -state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(), -plan.getDestination()); // Send OPEN RPC. This can fail if the server on other end is is not up. // Pass the version that was obtained while setting the node to OFFLINE. RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan .getDestination(), state.getRegion(), versionOfOfflineNode); -if (regionOpenState == RegionOpeningState.ALREADY_OPENED) { +if (regionOpenState.equals(RegionOpeningState.OPENED)) { + // Transition RegionState to PENDING_OPEN. It covers the period between the send of the + // rpc and our getting the callback setting the region state to OPENING. This is + // in-memory only change. Out in zk the znode is OFFLINE and we are waiting on + // regionserver to assume ownership by moving it to OPENING. + state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(), +plan.getDestination()); +} else if (regionOpenState == RegionOpeningState.ALREADY_OPENED) { // Remove region from in-memory transition and unassigned node from ZK // While trying to enable the table the regions of the table were // already enabled. {code} Here we set region to be PENDING_OPEN AFTER we send the open rpc. Now we know that a region that is PENDING_OPEN will not be retried by the single-assign and the state is clear; its that period post open rpc but before we get the znode callback which sets the RegionState to OPENING. Over in SSH, I can safely add PENDING_OPEN regions to the set of those to bulk assign if they are against the dead server currently being processed. What do you fellas think? I need to look at OFFLINE states too to see if they will always get retried by single-assign. If so, we can leave these out of the SSH recover. > Regions's in OPENING state from failed regionservers takes a long time to > recover > - > > Key: HBASE-6060 > URL: https://issues.apache.org/jira/browse/HBASE-6060 > Project: HBase > Issue Type: Bug > Components: master, regionserver >Reporter: Enis Soztutar >Assignee: rajeshbabu > Fix For: 0.96.0, 0.94.1, 0.92.3 > > Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, > 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, > 6060-trunk_3.patch, 6
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292549#comment-13292549 ] ramkrishna.s.vasudevan commented on HBASE-6060: --- @Ted Myself or Rajesh, either of us will investigate the test failures tomorrw once we reach office. > Regions's in OPENING state from failed regionservers takes a long time to > recover > - > > Key: HBASE-6060 > URL: https://issues.apache.org/jira/browse/HBASE-6060 > Project: HBase > Issue Type: Bug > Components: master, regionserver >Reporter: Enis Soztutar >Assignee: rajeshbabu > Fix For: 0.96.0, 0.94.1, 0.92.3 > > Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, > 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, > 6060-trunk_3.patch, 6060_alternative_suggestion.txt, > 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, > 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, > HBASE-6060-92.patch, HBASE-6060-94.patch > > > we have seen a pattern in tests, that the regions are stuck in OPENING state > for a very long time when the region server who is opening the region fails. > My understanding of the process: > > - master calls rs to open the region. If rs is offline, a new plan is > generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in > master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), > HMaster.assign() > - RegionServer, starts opening a region, changes the state in znode. But > that znode is not ephemeral. (see ZkAssign) > - Rs transitions zk node from OFFLINE to OPENING. See > OpenRegionHandler.process() > - rs then opens the region, and changes znode from OPENING to OPENED > - when rs is killed between OPENING and OPENED states, then zk shows OPENING > state, and the master just waits for rs to change the region state, but since > rs is down, that wont happen. > - There is a AssignmentManager.TimeoutMonitor, which does exactly guard > against these kind of conditions. It periodically checks (every 10 sec by > default) the regions in transition to see whether they timedout > (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, > which explains what you and I are seeing. > - ServerShutdownHandler in Master does not reassign regions in OPENING > state, although it handles other states. > Lowering that threshold from the configuration is one option, but still I > think we can do better. > Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4368) Expose processlist in shell (per regionserver and perhaps by cluster)
[ https://issues.apache.org/jira/browse/HBASE-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahin Saneinejad updated HBASE-4368: - Attachment: HBASE-4368.patch Attached a patch, I'd appreciate any feedback. I'm planning on adding unit tests modelled on admin_test.rb as soon as I figure out how to run individual jruby unit tests (is there a maven option?). > Expose processlist in shell (per regionserver and perhaps by cluster) > - > > Key: HBASE-4368 > URL: https://issues.apache.org/jira/browse/HBASE-4368 > Project: HBase > Issue Type: Task > Components: shell >Reporter: stack > Labels: noob > Attachments: HBASE-4368.patch > > > HBASE-4057 adds processlist and it shows in the RS UI. This issue is about > getting the processlist to show in the shell, like it does in mysql. > Labelling it noob; this is a pretty substantial issue but it shouldn't be too > hard -- it'd mostly be plumbing from RS into the shell. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4791) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/HBASE-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-4791: --- Attachment: HBASE-4791-v0.patch I've attached a first draft patch that allows Master, Region Servers and Quorum Peer to be started without {code}-Djava.security.auth.login.config=jaas.conf{code} but using instead hbase-site.xml configuration * hbase.zookeeper.client.keytab.file * hbase.zookeeper.client.kerberos.principal "Client" properties are used by HBase Master and Region Servers. * hbase.zookeeper.server.keytab.file * hbase.zookeeper.server.kerberos.principal "Server" properties are used by Quorum Peer when zookeepe is not external. Anyway you still need to specify the login.config -D option when you're using the hbase shell or your client application. _Refactoring a bit hadoop.security.UserGroupInformation and extracting HadoopConfiguration, we can remove the JaasConfiguration code and simplify the ZK Login._ > Allow Secure Zookeeper JAAS configuration to be programmatically set (rather > than only by reading JAAS configuration file) > -- > > Key: HBASE-4791 > URL: https://issues.apache.org/jira/browse/HBASE-4791 > Project: HBase > Issue Type: Improvement >Reporter: Eugene Koontz >Assignee: Eugene Koontz > Labels: security, zookeeper > Attachments: HBASE-4791-v0.patch > > > In the currently proposed fix for HBASE-2418, there must be a JAAS file > specified in System.setProperty("java.security.auth.login.config"). > However, it might be preferable to construct a JAAS configuration > programmatically, as is done with secure Hadoop (see > https://github.com/apache/hadoop-common/blob/a48eceb62c9b5c1a5d71ee2945d9eea2ed62527b/src/java/org/apache/hadoop/security/UserGroupInformation.java#L175). > This would have the benefit of avoiding a usage of a system property setting, > and allow instead an HBase-local configuration setting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292530#comment-13292530 ] Zhihong Ted Yu commented on HBASE-6060: --- For 6060_suggestion_toassign_rs_wentdown_beforerequest.patch: Can you give the following variable better name ? {code} +Set regionPlans = new ConcurrentSkipListSet(); {code} The set doesn't hold region plans. The following javadoc needs to be adjusted accordingly. {code} + * @return Pair that has all regionplans that pertain to this dead server and a list that has {code} {code} + if ((region.getState() == RegionState.State.OFFLINE) + && (region.getState() == RegionState.State.PENDING_OPEN)) { {code} A region cannot be in both states at the same time. '||' should be used instead of '&&' {code} +deadRegions = new TreeSet(assignedRegions); {code} Since the fulfillment of deadRegions above is in a different code block from the following: {code} if (deadRegions.remove(region.getRegion())) { {code} Running testSSHWhenSourceRSandDestRSInRegionPlanGoneDown (from v3) would lead to NPE w.r.t. deadRegions After fixing the above, testSSHWhenSourceRSandDestRSInRegionPlanGoneDown still fails. > Regions's in OPENING state from failed regionservers takes a long time to > recover > - > > Key: HBASE-6060 > URL: https://issues.apache.org/jira/browse/HBASE-6060 > Project: HBase > Issue Type: Bug > Components: master, regionserver >Reporter: Enis Soztutar >Assignee: rajeshbabu > Fix For: 0.96.0, 0.94.1, 0.92.3 > > Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, > 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, > 6060-trunk_3.patch, 6060_alternative_suggestion.txt, > 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, > 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, > HBASE-6060-92.patch, HBASE-6060-94.patch > > > we have seen a pattern in tests, that the regions are stuck in OPENING state > for a very long time when the region server who is opening the region fails. > My understanding of the process: > > - master calls rs to open the region. If rs is offline, a new plan is > generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in > master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), > HMaster.assign() > - RegionServer, starts opening a region, changes the state in znode. But > that znode is not ephemeral. (see ZkAssign) > - Rs transitions zk node from OFFLINE to OPENING. See > OpenRegionHandler.process() > - rs then opens the region, and changes znode from OPENING to OPENED > - when rs is killed between OPENING and OPENED states, then zk shows OPENING > state, and the master just waits for rs to change the region state, but since > rs is down, that wont happen. > - There is a AssignmentManager.TimeoutMonitor, which does exactly guard > against these kind of conditions. It periodically checks (every 10 sec by > default) the regions in transition to see whether they timedout > (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, > which explains what you and I are seeing. > - ServerShutdownHandler in Master does not reassign regions in OPENING > state, although it handles other states. > Lowering that threshold from the configuration is one option, but still I > think we can do better. > Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira