[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

2011-11-08 Thread mingjian (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146748#comment-13146748
 ] 

mingjian commented on HBASE-3914:
-

@stack:
I created HBASE-4762.

 ROOT region appeared in two regionserver's onlineRegions at the same time
 -

 Key: HBASE-3914
 URL: https://issues.apache.org/jira/browse/HBASE-3914
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Fix For: 0.90.4

 Attachments: HBASE-3914-V2.patch, HBASE-3914.patch


 This could be happen under the following steps with little probability:
 (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 
 10,000 regions in the cluster)
 1.Root region was opened in RS1.
 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
 3.ServerShutdownHandler process start.
 4.HMaster was restarted, during the finishInitialization's handling, ROOT 
 region was unsetted, and assigned to RS2. 
 5.Root region was opened successfully in RS2.
 6.But after while, ROOT region was unsetted again by RS1's 
 ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was 
 restarted. So there's two possibilities:
  Case a:
ROOT region was assigned to RS1. 
It seemed nothing would be affected. But the root region was still online 
 in RS2.  

  Case b:
ROOT region was assigned to RS2.
The ROOT Region couldn't be opened until it would be reassigned to other 
 regionserver, because it was showed online in this regionserver.
 This could be proved from the logs:
 1. ROOT region was opened with two times:
 2011-05-17 10:32:59,188 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
 2011-05-17 10:33:01,536 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, 
 but already online on this server:
 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: 
 Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of -ROOT-,,0.70236052 10:49:30,920 WARN 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted 
 open of -ROOT-,,0.70236052 but already online on this server
 This could be cause a long break of ROOT region offline, though it happened 
 under a special scenario. And I have checked the code, it seems a tiny bug 
 here.
 There's 2 references about assignRoot():
 1.
 HMaster# assignRootAndMeta:
 if (!catalogTracker.verifyRootRegionLocation(timeout)) {
   this.assignmentManager.assignRoot();
   this.catalogTracker.waitForRoot();
   assigned++;
 }
 2.
 ServerShutdownHandler# process: 
 
   if (isCarryingRoot()) { // -ROOT-  
 try {
this.services.getAssignmentManager().assignRoot();
 } catch (KeeperException e) {
this.server.abort(In server shutdown processing, assigning root, 
 e);
throw new IOException(Aborting, e);
 }
   }
 I think each time call the method of assignRoot(), we should verify Root 
 Region's Location first. Because before the assigning, the ROOT region could 
 have been assigned by another place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3914) ROOT region appeared in two regionserver's onlineRegions at the same time

2011-11-07 Thread mingjian (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146111#comment-13146111
 ] 

mingjian commented on HBASE-3914:
-

@stack: The following is our master log:
{noformat} 
2011-10-19 19:13:34,873 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event M_META_SERVER_S
HUTDOWN
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not running yet
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1090)

at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:256)
at $Proxy7.getRegionInfo(Unknown Source)
at 
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424)
at 
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:471)
at 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:90)
at 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:126)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662){noformat} 

After this, -ROOT-'s region won't be assigned, like this:
{noformat} 
2011-10-19 19:18:40,000 DEBUG 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
locateRegionInMeta parent
Table=-ROOT-, metaLocation=address: dw79.kgb.sqa.cm4:60020, regioninfo: 
-ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after s
leep of 1000 because: org.apache.hadoop.hbase.NotServingRegionException: Region 
is not online: -ROOT-,,0
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2771)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1802)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:569)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1091)
{noformat}
So we should rewrite the verify method both in branch-0.90 and trunk

 ROOT region appeared in two regionserver's onlineRegions at the same time
 -

 Key: HBASE-3914
 URL: https://issues.apache.org/jira/browse/HBASE-3914
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.3
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Fix For: 0.90.4

 Attachments: HBASE-3914-V2.patch, HBASE-3914.patch


 This could be happen under the following steps with little probability:
 (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 
 10,000 regions in the cluster)
 1.Root region was opened in RS1.
 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
 3.ServerShutdownHandler process start.
 4.HMaster was restarted, during the finishInitialization's handling, ROOT 
 region was unsetted, and assigned to RS2. 
 5.Root region was opened successfully in RS2.
 6.But after while, ROOT region was unsetted again by RS1's 
 ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was 
 restarted. So there's two possibilities:
  Case a:
ROOT region was assigned to RS1. 
It seemed nothing would be affected. But the root region was still online 
 in RS2.  

  Case b:
ROOT region was assigned to RS2.
The ROOT Region couldn't be opened until it would be reassigned to other 
 regionserver, because it was showed online in this regionserver.
 This could be proved from the logs:
 1. ROOT region was opened with two times:
 2011-05-17 10:32:59,188 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
 2011-05-17 10:33:01,536 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, 
 but already online on this server:
 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: 
 Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG 
 

[jira] [Commented] (HBASE-4377) [hbck] Offline rebuild .META. from fs data only.

2011-11-06 Thread mingjian (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13145192#comment-13145192
 ] 

mingjian commented on HBASE-4377:
-

@Jonathan If a region is splitting how do we fix it without onlined parent and 
daughters?

 [hbck] Offline rebuild .META. from fs data only.
 

 Key: HBASE-4377
 URL: https://issues.apache.org/jira/browse/HBASE-4377
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
 Attachments: 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data-.0.90-v4.patch, 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data-.0.90.v3.patch, 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data-.patch, 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data-.trunk.v3.patch, 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data.0.92.v1.patch, 
 0001-HBASE-4377-hbck-Offline-rebuild-.META.-from-fs-data.0.92.v2.patch, 
 EXT_AC.regioninfo, EXT_ATU_05f84d32cbc0bdabf00e00bc2f3570f0.regioninfo, 
 hbase-4377-trunk.v2.patch, hbase-4377.0.90.v6.patch, hbase-4377.trunk.v3.txt, 
 hbase-4377.trunk.v4.txt, hbase-4377.trunk.v5.txt, hbase-4377.trunk.v6.patch


 In a worst case situation, it may be helpful to have an offline .META. 
 rebuilder that just looks at the file system's .regioninfos and rebuilds meta 
 from scratch.  Users could move bad regions out until there is a clean 
 rebuild.  
 It would likely fill in region split holes.  Follow on work could given 
 options to merge or select regions that overlap, or do online rebuilds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira