It caused the region couldn't been open anymore, for it has fallen into an loop 
of opening operations, but failed for each time. The Balancer would skip for 
the region still remain in RIT. So the regions looked un-balance between the 
regionservers.

I describe the problem step by step as following:

1.HMaster send Msg to openregion on RS1.
2.RS1 received the Msg, and start to open the region. Before the opening, 
update the state of ZK node from offline to opening.
3.IOException happened while openRegion, so the opening failed.
4.The ZK node state was still opening.
5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening 
Msg again. Maybe it send to RS2
6.RS2 execute the opening, while update the ZK node state, it got an unexpected 
state. So failed again.
7.Loop the steps from 5 to 6.

And from the code:

OpenRegionHandler#process
      if (!transitionZookeeperOfflineToOpening(encodedName)) {
        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
          encodedName);
        return;
      }
      /************************************************************************/
      /*********IOException happened, region is null***************************/
      /************************************************************************/
      region = openRegion(); 
      /************************************************************************/
      /*********(region == null) is true, so return 
directly*******************/    
      
/************************************************************************/  
      if (region == null) return;
      boolean failed = true;
      if (tickleOpening("post_region_open")) {
        if (updateMeta(region)) failed = false;
      }

OpenRegionHandler#openRegion
    HRegion region = null;
    try {
      /************************************************************************/
      /*********IOException happened here.. 
***********************************/    
      /************************************************************************/
        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
        new CancelableProgressable() {
          public boolean progress() {
            return tickleOpening("open_region_progress");
          }
        });
    } catch (IOException e) {
      LOG.error("Failed open of region=" +
        this.regionInfo.getRegionNameAsString(), e);
    }
    
    return region;

Here's the logs:
2011-05-20 15:49:48,122 ERROR 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of 
region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
java.io.IOException: Exception occured while connecting to the server
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
        at $Proxy6.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-05-20 16:21:27,731 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open 
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempting to transition node 
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node 
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed but was in the state 
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:21:27,732 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:21:27,732 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,737 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open 
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempting to transition node 
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING
2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node 
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed but was in the state 
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:30:27,738 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,738 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,747 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open 
of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempting to transition node 
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING
2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node 
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed but was in the state 
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:48:27,748 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,748 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:51:27,748 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.

Reply via email to