That sounds reasonable Jieshan. Would you mind filing an issue referring to this mail thread? If you have a patch, that'd be excellent. St.Ack
2011/5/23 bijieshan <[email protected]>: > There's 2 references about assignRoot(): > > 1. > HMaster# assignRootAndMeta: > > if (!catalogTracker.verifyRootRegionLocation(timeout)) { > this.assignmentManager.assignRoot(); > this.catalogTracker.waitForRoot(); > assigned++; > } > > 2. > ServerShutdownHandler# process: > > if (isCarryingRoot()) { // -ROOT- > try { > this.services.getAssignmentManager().assignRoot(); > } catch (KeeperException e) { > this.server.abort("In server shutdown processing, assigning root", > e); > throw new IOException("Aborting", e); > } > } > > I think each time call the method of assignRoot(), we should verify Root > Region's Location first. Because before the assigning, the ROOT region could > have been assigned by another place. > Expecting for anyone's reply. > > Thanks! > > Regards, > Jieshan Bean > > > -----邮件原件----- > 发件人: bijieshan [mailto:[email protected]] > 发送时间: 2011年5月20日 15:34 > 收件人: [email protected] > 抄送: Chenjian > 主题: ROOT region appeared in two regionserver's onlineRegions at the same time > > This could be happen under the following steps with little probability: > (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than > 10,000 regions in the cluster) > > 1.Root region was opened in RS1. > 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted. > 3.ServerShutdownHandler process start. > 4.HMaster was restarted, during the finishInitialization's handling, ROOT > region was unsetted, and assigned to RS2. > 5.Root region was opened successfully in RS2. > 6.But after while, ROOT region was unsetted again by RS1's > ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was > restarted. So there's two possibilities: > Case a: > ROOT region was assigned to RS1. > It seemed nothing would be affected. But the root region was still online > in RS2. > > Case b: > ROOT region was assigned to RS2. > The ROOT Region couldn't be opened until it would be reassigned to other > regionserver, because it was showed online in this regionserver. > > This could be proved from the logs: > > 1. ROOT region was opened with two times: > 2011-05-17 10:32:59,188 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031 > 2011-05-17 10:33:01,536 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212 > > 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, > but already online on this server: > 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: > Received request to open region: -ROOT-,,0.70236052 > 10:49:30,920 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing > open of -ROOT-,,0.70236052 > 10:49:30,920 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted > open of -ROOT-,,0.70236052 but already online on this server > > This could be cause a long break of ROOT region offline, though it happened > under a special scenario. And I have checked the code, it seems a tiny bug > here. > > Thanks! > > Regards, > Jieshan Bean > > > > > >
