What's going on in the master? When a region times out in OPENING this
is what the TimeoutMonitor does:
// Attempt to transition node into OFFLINE
try {
data = new RegionTransitionData(
EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
master.getServerName());
if (ZKUtil.setData(watcher, node, data.getBytes(),
stat.getVersion())) {
// Node is now OFFLINE, let's trigger another assignment
ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
LOG.info("Successfully transitioned region=" +
regionInfo.getRegionNameAsString() + " into OFFLINE" +
" and forcing a new assignment");
assigns.put(regionState.getRegion(), Boolean.TRUE);
}
So the node should be moved into OFFLINE, looks like it's not happening?
J-D
On Thu, May 26, 2011 at 12:15 AM, bijieshan <[email protected]> wrote:
> It caused the region couldn't been open anymore, for it has fallen into an
> loop of opening operations, but failed for each time. The Balancer would skip
> for the region still remain in RIT. So the regions looked un-balance between
> the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening,
> update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the
> opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an
> unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
> if (!transitionZookeeperOfflineToOpening(encodedName)) {
> LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
> encodedName);
> return;
> }
>
> /************************************************************************/
> /*********IOException happened, region is
> null***************************/
>
> /************************************************************************/
> region = openRegion();
>
> /************************************************************************/
> /*********(region == null) is true, so return
> directly*******************/
>
> /************************************************************************/
> if (region == null) return;
> boolean failed = true;
> if (tickleOpening("post_region_open")) {
> if (updateMeta(region)) failed = false;
> }
>
> OpenRegionHandler#openRegion
> HRegion region = null;
> try {
>
> /************************************************************************/
> /*********IOException happened here..
> ***********************************/
>
> /************************************************************************/
> region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
> this.server.getConfiguration(), this.rsServices.getFlushRequester(),
> new CancelableProgressable() {
> public boolean progress() {
> return tickleOpening("open_region_progress");
> }
> });
> } catch (IOException e) {
> LOG.error("Failed open of region=" +
> this.regionInfo.getRegionNameAsString(), e);
> }
>
> return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
> of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
> at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
> at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
> at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
> at $Proxy6.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
> at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
> at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>