After I trace into the logs and the code, I found the problem.
Maybe I didn't describe the problem correctly. The title is also puzzling. 

I try again to show the scenario of how to create the problem:

1.HMaster assigned the region A to RS1. So the RegionState was set to 
PENDING_OPEN.
2.For there's too many opening requests, the open process on RS1 was blocked.
3.Some time later, TimeoutMonitor found the assigning of A was timeout. For the 
RegionState was in PENDING_OPEN, went into the following handler process(Just 
put the region into an waiting-assigning set):

   case PENDING_OPEN:
      LOG.info("Region has been PENDING_OPEN for too " +
          "long, reassigning region=" +
          regionInfo.getRegionNameAsString());
      assigns.put(regionState.getRegion(), Boolean.TRUE);
      break; 
So we can see that, under this case, we consider the ZK node state was OFFLINE. 
Indeed, in an normal disposal, it's OK.

4.But before the real-assigning, the requests of RS1 was disposed. So that 
affected the new-assigning. For it update the ZK node state from OFFLINE to 
OPENING. 

5.The new assigning started, so it send region to open in RS2. But while the 
opening, it should update the ZK node state from OFFLINE to OPENING. For the 
current state is OPENING, so this operation failed.
So this region couldn't be open success anymore.

So I think, to void this problem , under the case of PENDING_OPEN of 
TiemoutMonitor, we should transform the ZK node state to OFFLINE first.

Thanks!

Jieshan Bean 

------------------------

Hi,
During that time, there's too many regions were assigning.
I have read the related code, but the problem is still scratch my head over. 
The fact is the region could not open for the zk state is not the expect one.

2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned node 
for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed but was in the state 
RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161

So the question is, under what condition could cause the inconsistently states?

This is the a segment of HMaster logs around that time(There's so many logs 
like this)

15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning 
region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. to 
157-5-111-13,20020,1305877624933
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENED, 
server=157-5-111-14,20020,1305877627727, region=5910a81f573f8e9e255db473e9407ab4
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. 
state=PENDING_OPEN, ts=1305877600490
2011-05-20 15:49:47,867 DEBUG 
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a 
random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., 
src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) 
available servers

Regards,
Jieshan Bean



--------------

I was asking about what was going on in the master during that time, I
really would like to see it. It should be some time after that
exception:

2011-05-20 15:49:48,122 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.

About resetting the znode, as you can see in TimeoutMonitor we don't
really care if it was reset or not as it should take care of doing it.
The issue here is getting at the root of the problem.

J-D

Reply via email to