I was asking about what was going on in the master during that time, I really would like to see it. It should be some time after that exception:
2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa. About resetting the znode, as you can see in TimeoutMonitor we don't really care if it was reset or not as it should take care of doing it. The issue here is getting at the root of the problem. J-D On Thu, May 26, 2011 at 6:37 PM, bijieshan <[email protected]> wrote: > Thanks J-D. > When a region times out in OPENNING, TimeoutMonitor will find it and then > assign again. There's no doubt here. > But while deal with the ZK node state, it got something wrong. I can describe > it in another way, while IOE happened in OpenRegion, the ZK node state should > be return to the offline state, but in the current code, it seems doing > nothing but return directly. > So the problem is not in the TimeoutMonitor handling process. > > Maybe that's not correct, please give out your viewpoint about that, thanks. > > Please see my comments on the code: > > OpenRegionHandler#process: > > /*****Before the opening, transition the ZK node state from offline to > opening first***********/ > /*****So if it got failure, this node state should be reset to offline > ************************/ > if (!transitionZookeeperOfflineToOpening(encodedName)) { > LOG.warn("Region was hijacked? It no longer exists, encodedName=" + > encodedName); > return; > } > /******The comments seems correct, but indeed it didn't handle the failures > *******************/ > // Open region. After a successful open, failures in subsequent > // processing needs to do a close as part of cleanup. > /*********while IOE happened in openRegion(), region is null************/ > region = openRegion(); > /*********So here return but didn't reset the ZK state************/ > if (region == null) return; > boolean failed = true; > if (tickleOpening("post_region_open")) { > if (updateMeta(region)) failed = false; > } > > Jieshan Bean > > ----- ----- > 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans > 发送时间: 2011年5月27日 1:02 > 收件人: [email protected] > 主题: Re: HRegion.openHRegion IOException caused an endless loop of > opening--opening failed > > What's going on in the master? When a region times out in OPENING this > is what the TimeoutMonitor does: > > // Attempt to transition node into OFFLINE > try { > data = new RegionTransitionData( > EventType.M_ZK_REGION_OFFLINE, > regionInfo.getRegionName(), > master.getServerName()); > if (ZKUtil.setData(watcher, node, data.getBytes(), > stat.getVersion())) { > // Node is now OFFLINE, let's trigger another assignment > ZKUtil.getDataAndWatch(watcher, node); // re-set the > watch > LOG.info("Successfully transitioned region=" + > regionInfo.getRegionNameAsString() + " into OFFLINE" > + > " and forcing a new assignment"); > assigns.put(regionState.getRegion(), Boolean.TRUE); > } > > So the node should be moved into OFFLINE, looks like it's not happening? > > J-D > > On Thu, May 26, 2011 at 12:15 AM, bijieshan <[email protected]> wrote: >> It caused the region couldn't been open anymore, for it has fallen into an >> loop of opening operations, but failed for each time. The Balancer would >> skip for the region still remain in RIT. So the regions looked un-balance >> between the regionservers. >> >> I describe the problem step by step as following: >> >> 1.HMaster send Msg to openregion on RS1. >> 2.RS1 received the Msg, and start to open the region. Before the opening, >> update the state of ZK node from offline to opening. >> 3.IOException happened while openRegion, so the opening failed. >> 4.The ZK node state was still opening. >> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the >> opening Msg again. Maybe it send to RS2 >> 6.RS2 execute the opening, while update the ZK node state, it got an >> unexpected state. So failed again. >> 7.Loop the steps from 5 to 6. >> >> And from the code: >> >> OpenRegionHandler#process >> if (!transitionZookeeperOfflineToOpening(encodedName)) { >> LOG.warn("Region was hijacked? It no longer exists, encodedName=" + >> encodedName); >> return; >> } >> >> /************************************************************************/ >> /*********IOException happened, region is >> null***************************/ >> >> /************************************************************************/ >> region = openRegion(); >> >> /************************************************************************/ >> /*********(region == null) is true, so return >> directly*******************/ >> >> /************************************************************************/ >> if (region == null) return; >> boolean failed = true; >> if (tickleOpening("post_region_open")) { >> if (updateMeta(region)) failed = false; >> } >> >> OpenRegionHandler#openRegion >> HRegion region = null; >> try { >> >> /************************************************************************/ >> /*********IOException happened here.. >> ***********************************/ >> >> /************************************************************************/ >> region = HRegion.openHRegion(this.regionInfo, >> this.rsServices.getWAL(), >> this.server.getConfiguration(), this.rsServices.getFlushRequester(), >> new CancelableProgressable() { >> public boolean progress() { >> return tickleOpening("open_region_progress"); >> } >> }); >> } catch (IOException e) { >> LOG.error("Failed open of region=" + >> this.regionInfo.getRegionNameAsString(), e); >> } >> >> return region; >> >> Here's the logs: >> 2011-05-20 15:49:48,122 ERROR >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open >> of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa. >> java.io.IOException: Exception occured while connecting to the server >> at >> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162) >> at >> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118) >> at >> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95) >> at $Proxy6.getFileInfo(Unknown Source) >> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724) >> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537) >> at >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272) >> at >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99) >> at >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> 2011-05-20 16:21:27,731 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:21:27,731 DEBUG >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing >> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempting to transition node >> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING >> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned >> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING failed, the node existed but was in the state >> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161 >> 2011-05-20 16:21:27,732 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed >> transition from OFFLINE to OPENING for >> region=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:21:27,732 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was >> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:30:27,737 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:30:27,738 DEBUG >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing >> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempting to transition node >> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING >> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned >> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING failed, the node existed but was in the state >> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161 >> 2011-05-20 16:30:27,738 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed >> transition from OFFLINE to OPENING for >> region=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:30:27,738 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was >> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:48:27,747 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:48:27,747 DEBUG >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing >> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempting to transition node >> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING >> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned >> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to >> RS_ZK_REGION_OPENING failed, the node existed but was in the state >> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161 >> 2011-05-20 16:48:27,748 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed >> transition from OFFLINE to OPENING for >> region=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:48:27,748 WARN >> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was >> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51 >> 2011-05-20 16:51:27,748 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. >> >
