What's going on in the master? When a region times out in OPENING this
is what the TimeoutMonitor does:

                  // Attempt to transition node into OFFLINE
                  try {
                    data = new RegionTransitionData(
                      EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
                      master.getServerName());
                    if (ZKUtil.setData(watcher, node, data.getBytes(),
                        stat.getVersion())) {
                      // Node is now OFFLINE, let's trigger another assignment
                      ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
                      LOG.info("Successfully transitioned region=" +
                          regionInfo.getRegionNameAsString() + " into OFFLINE" +
                          " and forcing a new assignment");
                      assigns.put(regionState.getRegion(), Boolean.TRUE);
                    }

So the node should be moved into OFFLINE, looks like it's not happening?

J-D

On Thu, May 26, 2011 at 12:15 AM, bijieshan <[email protected]> wrote:
> It caused the region couldn't been open anymore, for it has fallen into an 
> loop of opening operations, but failed for each time. The Balancer would skip 
> for the region still remain in RIT. So the regions looked un-balance between 
> the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening, 
> update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the 
> opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an 
> unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>      
> /************************************************************************/
>      /*********IOException happened, region is 
> null***************************/
>      
> /************************************************************************/
>      region = openRegion();
>      
> /************************************************************************/
>      /*********(region == null) is true, so return 
> directly*******************/
>      
> /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>      
> /************************************************************************/
>      /*********IOException happened here.. 
> ***********************************/
>      
> /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open 
> of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at 
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at 
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at 
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempting to transition node 
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned 
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the node existed but was in the state 
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempting to transition node 
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned 
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the node existed but was in the state 
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempting to transition node 
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned 
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the node existed but was in the state 
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
> transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

Reply via email to