I've just had another of those region server shutdowns (RegionServerSuicide).  

   * how do I avoid disruption when this sort of thing happens
   * what is the best procedure for recovering from such a thing (i.e., is 
there something to be done other than simply restarting the region server?)

I've heard others talk of this event as being easy to recover from.  I'd like 
to understand how I can get into a configuration where that is the case.  
Currently, an RS shutdown causes significant disruption.  Clients are unable to 
query HBase.   Simply restarting the region server doesn't improve things.  In 
every case, I've needed to do a complete restart of the entire cluster which 
isn't exactly super fast.  A single region server shutdown means a minimum of 
1/2 hour down time, and sometimes that is more like hours depending on the 
reason for the RS shutdown (once it took an entire day).  Is the experience 
similarly painful for others?

Common error messages on the client are: java.lang.RuntimeException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up 
proxy to domU-12-31-39-18-1B-05.compute-1.internal/10.104.55.247:60020 after 
attempts=1
... and ... the RS complains with a "NotServingRegion" exception.    

As you can see form the selected messages below, after I restart the affected 
RS, the cluster is still unusable, and gives no indication that this state 
still change.  

It is of course no problem just restart a region server if it goes down (either 
with a script or manually), the real question is how do I avoid downtime?

Details are below...

Thank you for your help, 

Matthew

I'm running in EC2 on DFS from CDH3b2, using a version of HBase from 20100830.  

START OF THE SHUTDOWN ON AFFECTED RS: 

2010-09-22 22:00:39,887 DEBUG org.apache.hadoop.hbase.regionserver.Store: Major 
compaction triggered on store pg_payload; time since last major compaction 
93578344ms
2010-09-22 22:00:39,887 INFO org.apache.hadoop.hbase.regionserver.Store: 
Started compaction of 1 file(s) in pg_payload of 
userh,\x1D\xAA2f\xE0\x1F\x12:\xA4\x98r\xFF!\xB8\x1A`,1285083827570.6be256da616fc20a6ac78a41952104e8.
  into 
hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/userh/6be256da616fc20a6ac78a41952104e8/.tmp,
 sequenceid=77142618
2010-09-22 22:03:23,317 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
LRU Stats: total=1.57 GB, free=396.14 MB, max=1.95 GB, blocks=27819, 
accesses=38513233, hits=14159798, hitRatio=36.76%%, evictions=6494, 
evicted=20462105, evictedPerRun=3150.924560546875
2010-09-22 22:03:23,361 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 181390ms for sessionid 
0x12b266801550065, closing socket connection and attempting reconnect
2010-09-22 22:03:23,363 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 183256ms for sessionid 
0x12b266801550002, closing socket connection and attempting reconnect
2010-09-22 22:03:23,451 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server 
serverName=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559, 
load=(requests=4458, regions=134, usedHeap=8872, maxHeap=9995): Unhandled 
exception
org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 as 
dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:753)
        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)

THE RESTART ON THE AFFECTED RS: 

2010-09-22 22:10:04,534 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
2010-09-22 22:13:23,449 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 4628396768048238482 
lease expired
2010-09-22 22:13:23,449 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 891312872292456529 
lease expired
2010-09-22 22:13:23,449 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 2977737069391109815 
lease expired
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases: 
regionserver60020.leaseChecker closing leases
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases: 
regionserver60020.leaseChecker closed leases
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Shutdown hook starting; hbase.shutdown.hook=true; 
fsShutdownHook=Thread[Thread-9,5,main]
2010-09-22 22:13:23,451 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Starting fs shutdown hook thread.
2010-09-22 22:13:23,452 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Shutdown hook finished.
Wed Sep 22 22:15:11 UTC 2010 Starting regionserver on domU-12-31-39-18-1B-05
2010-09-22 22:15:12,716 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: vmInputArguments=[-Xmx768m, 
-XX:+HeapDumpOnOutOfMemoryError, -XX:+UseConcMarkSweepGC, 
-XX:+CMSIncrementalMode, -XX:+HeapDumpOnOutOfMemoryError, 
-XX:+UseConcMarkSweepGC, -XX:+CMSIncrementalMode, -Xmx10000m, 
-Dhbase.log.dir=/mnt/tmp/hbaselog, 
-Dhbase.log.file=hbase-mlcamus-regionserver-domU-12-31-39-18-1B-05.log, 
-Dhbase.home.dir=/home/mlcamus/apps/hbase/bin/.., -Dhbase.id.str=mlcamus, 
-Dhbase.root.logger=INFO,DRFA]
...
2010-09-22 22:25:03,530 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:04,288 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:05,299 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:05,348 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:06,308 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:06,358 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:07,368 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:07,545 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:08,318 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:08,377 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:08,548 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:08,552 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
user,^...@^tj�,1284527871890.bfee076d62bd372a70299f5854517145.
2010-09-22 22:25:09,288 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:09,558 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
2010-09-22 22:25:09,698 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
.META.,,1
... (these messages continue)

ON THE MASTER WHEN THE RS WAS SHUTTING DOWN:

2010-09-22 21:59:51,588 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 
.META. region(s) scanned
2010-09-22 22:00:22,931 INFO org.apache.hadoop.hbase.master.ServerManager: 2 
region servers, 0 dead, average load 134.0
2010-09-22 22:00:39,331 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020, 
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-09-22 22:00:39,333 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-09-22 22:00:51,368 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020, 
regionname: .META.,,1.1028785192, startKey: <>}
2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scan of 266 row(s) of meta region {server: 
10.104.55.247:60020, regionname: .META.,,1.1028785192, startKey: <>} complete
2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 
.META. region(s) scanned
2010-09-22 22:01:22,941 INFO org.apache.hadoop.hbase.master.ServerManager: 2 
region servers, 0 dead, average load 134.0
2010-09-22 22:01:39,341 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020, 
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-09-22 22:01:39,343 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-09-22 22:01:51,371 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020, 
regionname: .META.,,1.1028785192, startKey: <>}
2010-09-22 22:02:22,951 INFO org.apache.hadoop.hbase.master.ServerManager: 2 
region servers, 0 dead, average load 134.0
2010-09-22 22:02:24,203 INFO org.apache.hadoop.hbase.master.ServerManager: 
domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 znode expired
2010-09-22 22:02:24,204 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Added=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 to dead 
servers, added shutdown processing operation
2010-09-22 22:02:24,204 INFO org.apache.hadoop.hbase.master.RegionManager: META 
region whose startkey is  removed from onlineMetaRegions
2010-09-22 22:02:24,204 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessServerShutdown of 
domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559
2010-09-22 22:02:24,204 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of 
server domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559: logSplit: 
false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
2010-09-22 22:02:24,242 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
Splitting 35 hlog(s) in 
hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559

ON THE MASTER AFTER GIVING A FEW MINUTES TO RECOVER: 

2010-09-22 22:31:00,300 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, 
onlineMetaRegions.size(): 0
2010-09-22 22:31:00,300 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all 
meta regions are online
2010-09-22 22:31:00,301 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessRegionClose of 
userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44.,
 false, reassign: true
2010-09-22 22:31:00,301 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, 
onlineMetaRegions.size(): 0
2010-09-22 22:31:00,301 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all 
meta regions are online
2010-09-22 22:31:10,310 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessRegionClose of 
userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44.,
 false, reassign: true
2010-09-22 22:31:10,310 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, 
onlineMetaRegions.size(): 0
2010-09-22 22:31:10,310 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all 
meta regions are online
2010-09-22 22:31:10,311 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessRegionClose of 
userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab0058a71cff.,
 false, reassign: true
2010-09-22 22:31:10,311 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, 
onlineMetaRegions.size(): 0
2010-09-22 22:31:10,311 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all 
meta regions are online
2010-09-22 22:31:20,320 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessRegionClose of 
userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab00

WHEN RUNNING "list" FROM "hbase shell": 

org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2217)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1808)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)


Reply via email to