It caused the region couldn't been open anymore, for it has fallen into an loop
of opening operations, but failed for each time. The Balancer would skip for
the region still remain in RIT. So the regions looked un-balance between the
regionservers.
I describe the problem step by step as following:
1.HMaster send Msg to openregion on RS1.
2.RS1 received the Msg, and start to open the region. Before the opening,
update the state of ZK node from offline to opening.
3.IOException happened while openRegion, so the opening failed.
4.The ZK node state was still opening.
5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening
Msg again. Maybe it send to RS2
6.RS2 execute the opening, while update the ZK node state, it got an unexpected
state. So failed again.
7.Loop the steps from 5 to 6.
And from the code:
OpenRegionHandler#process
if (!transitionZookeeperOfflineToOpening(encodedName)) {
LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
encodedName);
return;
}
/************************************************************************/
/*********IOException happened, region is null***************************/
/************************************************************************/
region = openRegion();
/************************************************************************/
/*********(region == null) is true, so return
directly*******************/
/************************************************************************/
if (region == null) return;
boolean failed = true;
if (tickleOpening("post_region_open")) {
if (updateMeta(region)) failed = false;
}
OpenRegionHandler#openRegion
HRegion region = null;
try {
/************************************************************************/
/*********IOException happened here..
***********************************/
/************************************************************************/
region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
this.server.getConfiguration(), this.rsServices.getFlushRequester(),
new CancelableProgressable() {
public boolean progress() {
return tickleOpening("open_region_progress");
}
});
} catch (IOException e) {
LOG.error("Failed open of region=" +
this.regionInfo.getRegionNameAsString(), e);
}
return region;
Here's the logs:
2011-05-20 15:49:48,122 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of
region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
java.io.IOException: Exception occured while connecting to the server
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
at $Proxy6.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
at
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-05-20 16:21:27,731 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempting to transition node
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING failed, the node existed but was in the state
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:21:27,732 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:21:27,732 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,737 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempting to transition node
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING failed, the node existed but was in the state
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:30:27,738 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,738 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,747 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempting to transition node
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING failed, the node existed but was in the state
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:48:27,748 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,748 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:51:27,748 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.