I've just had another of those region server shutdowns (RegionServerSuicide).
* how do I avoid disruption when this sort of thing happens
* what is the best procedure for recovering from such a thing (i.e., is
there something to be done other than simply restarting the region server?)
I've heard others talk of this event as being easy to recover from. I'd like
to understand how I can get into a configuration where that is the case.
Currently, an RS shutdown causes significant disruption. Clients are unable to
query HBase. Simply restarting the region server doesn't improve things. In
every case, I've needed to do a complete restart of the entire cluster which
isn't exactly super fast. A single region server shutdown means a minimum of
1/2 hour down time, and sometimes that is more like hours depending on the
reason for the RS shutdown (once it took an entire day). Is the experience
similarly painful for others?
Common error messages on the client are: java.lang.RuntimeException:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
proxy to domU-12-31-39-18-1B-05.compute-1.internal/10.104.55.247:60020 after
attempts=1
... and ... the RS complains with a "NotServingRegion" exception.
As you can see form the selected messages below, after I restart the affected
RS, the cluster is still unusable, and gives no indication that this state
still change.
It is of course no problem just restart a region server if it goes down (either
with a script or manually), the real question is how do I avoid downtime?
Details are below...
Thank you for your help,
Matthew
I'm running in EC2 on DFS from CDH3b2, using a version of HBase from 20100830.
START OF THE SHUTDOWN ON AFFECTED RS:
2010-09-22 22:00:39,887 DEBUG org.apache.hadoop.hbase.regionserver.Store: Major
compaction triggered on store pg_payload; time since last major compaction
93578344ms
2010-09-22 22:00:39,887 INFO org.apache.hadoop.hbase.regionserver.Store:
Started compaction of 1 file(s) in pg_payload of
userh,\x1D\xAA2f\xE0\x1F\x12:\xA4\x98r\xFF!\xB8\x1A`,1285083827570.6be256da616fc20a6ac78a41952104e8.
into
hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/userh/6be256da616fc20a6ac78a41952104e8/.tmp,
sequenceid=77142618
2010-09-22 22:03:23,317 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
LRU Stats: total=1.57 GB, free=396.14 MB, max=1.95 GB, blocks=27819,
accesses=38513233, hits=14159798, hitRatio=36.76%%, evictions=6494,
evicted=20462105, evictedPerRun=3150.924560546875
2010-09-22 22:03:23,361 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 181390ms for sessionid
0x12b266801550065, closing socket connection and attempting reconnect
2010-09-22 22:03:23,363 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 183256ms for sessionid
0x12b266801550002, closing socket connection and attempting reconnect
2010-09-22 22:03:23,451 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server
serverName=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559,
load=(requests=4458, regions=134, usedHeap=8872, maxHeap=9995): Unhandled
exception
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently
processing domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 as
dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:753)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
THE RESTART ON THE AFFECTED RS:
2010-09-22 22:10:04,534 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
2010-09-22 22:13:23,449 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 4628396768048238482
lease expired
2010-09-22 22:13:23,449 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 891312872292456529
lease expired
2010-09-22 22:13:23,449 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 2977737069391109815
lease expired
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases:
regionserver60020.leaseChecker closing leases
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases:
regionserver60020.leaseChecker closed leases
2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-9,5,main]
2010-09-22 22:13:23,451 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook:
Starting fs shutdown hook thread.
2010-09-22 22:13:23,452 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook finished.
Wed Sep 22 22:15:11 UTC 2010 Starting regionserver on domU-12-31-39-18-1B-05
2010-09-22 22:15:12,716 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: vmInputArguments=[-Xmx768m,
-XX:+HeapDumpOnOutOfMemoryError, -XX:+UseConcMarkSweepGC,
-XX:+CMSIncrementalMode, -XX:+HeapDumpOnOutOfMemoryError,
-XX:+UseConcMarkSweepGC, -XX:+CMSIncrementalMode, -Xmx10000m,
-Dhbase.log.dir=/mnt/tmp/hbaselog,
-Dhbase.log.file=hbase-mlcamus-regionserver-domU-12-31-39-18-1B-05.log,
-Dhbase.home.dir=/home/mlcamus/apps/hbase/bin/.., -Dhbase.id.str=mlcamus,
-Dhbase.root.logger=INFO,DRFA]
...
2010-09-22 22:25:03,530 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:04,288 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:05,299 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:05,348 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:06,308 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:06,358 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:07,368 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:07,545 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:08,318 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:08,377 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:08,548 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:08,552 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
user,^...@^tj�,1284527871890.bfee076d62bd372a70299f5854517145.
2010-09-22 22:25:09,288 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:09,558 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
2010-09-22 22:25:09,698 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
.META.,,1
... (these messages continue)
ON THE MASTER WHEN THE RS WAS SHUTTING DOWN:
2010-09-22 21:59:51,588 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1
.META. region(s) scanned
2010-09-22 22:00:22,931 INFO org.apache.hadoop.hbase.master.ServerManager: 2
region servers, 0 dead, average load 134.0
2010-09-22 22:00:39,331 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020,
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-09-22 22:00:39,333 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-09-22 22:00:51,368 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020,
regionname: .META.,,1.1028785192, startKey: <>}
2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 266 row(s) of meta region {server:
10.104.55.247:60020, regionname: .META.,,1.1028785192, startKey: <>} complete
2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1
.META. region(s) scanned
2010-09-22 22:01:22,941 INFO org.apache.hadoop.hbase.master.ServerManager: 2
region servers, 0 dead, average load 134.0
2010-09-22 22:01:39,341 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020,
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-09-22 22:01:39,343 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-09-22 22:01:51,371 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020,
regionname: .META.,,1.1028785192, startKey: <>}
2010-09-22 22:02:22,951 INFO org.apache.hadoop.hbase.master.ServerManager: 2
region servers, 0 dead, average load 134.0
2010-09-22 22:02:24,203 INFO org.apache.hadoop.hbase.master.ServerManager:
domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 znode expired
2010-09-22 22:02:24,204 DEBUG org.apache.hadoop.hbase.master.ServerManager:
Added=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 to dead
servers, added shutdown processing operation
2010-09-22 22:02:24,204 INFO org.apache.hadoop.hbase.master.RegionManager: META
region whose startkey is removed from onlineMetaRegions
2010-09-22 22:02:24,204 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessServerShutdown of
domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559
2010-09-22 22:02:24,204 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of
server domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559: logSplit:
false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
2010-09-22 22:02:24,242 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting 35 hlog(s) in
hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559
ON THE MASTER AFTER GIVING A FEW MINUTES TO RECOVER:
2010-09-22 22:31:00,300 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1,
onlineMetaRegions.size(): 0
2010-09-22 22:31:00,300 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all
meta regions are online
2010-09-22 22:31:00,301 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessRegionClose of
userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44.,
false, reassign: true
2010-09-22 22:31:00,301 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1,
onlineMetaRegions.size(): 0
2010-09-22 22:31:00,301 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all
meta regions are online
2010-09-22 22:31:10,310 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessRegionClose of
userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44.,
false, reassign: true
2010-09-22 22:31:10,310 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1,
onlineMetaRegions.size(): 0
2010-09-22 22:31:10,310 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all
meta regions are online
2010-09-22 22:31:10,311 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessRegionClose of
userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab0058a71cff.,
false, reassign: true
2010-09-22 22:31:10,311 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1,
onlineMetaRegions.size(): 0
2010-09-22 22:31:10,311 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not all
meta regions are online
2010-09-22 22:31:20,320 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessRegionClose of
userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab00
WHEN RUNNING "list" FROM "hbase shell":
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2217)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1808)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)