It seem that the log splitting didn't succeed (can't tell why, small log snippet is too small). Did it get stuck on EOF or something like that? In any case, it looks like a bug.
> * how do I avoid disruption when this sort of thing happens Your region server was unavailable for 3 minutes, probably a GC pause (did you enable GC logging?), so the master splits the logs after it was notified via a ZK watcher. The timeout is 1 minute, you can set that lower if you want to answer to region server failure faster (but be more affected by shorter GC pauses or similar). You can also lower the recovery time by setting a smaller hbase.regionserver.maxlogs (which is 32 by default), at the expense of flushing a lot more. > * what is the best procedure for recovering from such a thing (i.e., is > there something to be done other than simply restarting the region server?) It's fatal if the master isn't able to complete log splitting properly, so restarting is your best bet until we can figure how you got stuck there. J-D On Wed, Sep 22, 2010 at 3:41 PM, Matthew LeMieux <[email protected]> wrote: > I've just had another of those region server shutdowns (RegionServerSuicide). > > * how do I avoid disruption when this sort of thing happens > * what is the best procedure for recovering from such a thing (i.e., is > there something to be done other than simply restarting the region server?) > > I've heard others talk of this event as being easy to recover from. I'd like > to understand how I can get into a configuration where that is the case. > Currently, an RS shutdown causes significant disruption. Clients are unable > to query HBase. Simply restarting the region server doesn't improve things. > In every case, I've needed to do a complete restart of the entire cluster > which isn't exactly super fast. A single region server shutdown means a > minimum of 1/2 hour down time, and sometimes that is more like hours > depending on the reason for the RS shutdown (once it took an entire day). Is > the experience similarly painful for others? > > Common error messages on the client are: java.lang.RuntimeException: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up > proxy to domU-12-31-39-18-1B-05.compute-1.internal/10.104.55.247:60020 after > attempts=1 > ... and ... the RS complains with a "NotServingRegion" exception. > > As you can see form the selected messages below, after I restart the affected > RS, the cluster is still unusable, and gives no indication that this state > still change. > > It is of course no problem just restart a region server if it goes down > (either with a script or manually), the real question is how do I avoid > downtime? > > Details are below... > > Thank you for your help, > > Matthew > > I'm running in EC2 on DFS from CDH3b2, using a version of HBase from 20100830. > > START OF THE SHUTDOWN ON AFFECTED RS: > > 2010-09-22 22:00:39,887 DEBUG org.apache.hadoop.hbase.regionserver.Store: > Major compaction triggered on store pg_payload; time since last major > compaction 93578344ms > 2010-09-22 22:00:39,887 INFO org.apache.hadoop.hbase.regionserver.Store: > Started compaction of 1 file(s) in pg_payload of > userh,\x1D\xAA2f\xE0\x1F\x12:\xA4\x98r\xFF!\xB8\x1A`,1285083827570.6be256da616fc20a6ac78a41952104e8. > into > hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/userh/6be256da616fc20a6ac78a41952104e8/.tmp, > sequenceid=77142618 > 2010-09-22 22:03:23,317 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=1.57 GB, free=396.14 MB, max=1.95 GB, blocks=27819, > accesses=38513233, hits=14159798, hitRatio=36.76%%, evictions=6494, > evicted=20462105, evictedPerRun=3150.924560546875 > 2010-09-22 22:03:23,361 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 181390ms for sessionid > 0x12b266801550065, closing socket connection and attempting reconnect > 2010-09-22 22:03:23,363 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 183256ms for sessionid > 0x12b266801550002, closing socket connection and attempting reconnect > 2010-09-22 22:03:23,451 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server > serverName=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559, > load=(requests=4458, regions=134, usedHeap=8872, maxHeap=9995): Unhandled > exception > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing > domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 as dead server > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271) > at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:753) > at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919) > > THE RESTART ON THE AFFECTED RS: > > 2010-09-22 22:10:04,534 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting > 2010-09-22 22:13:23,449 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > 4628396768048238482 lease expired > 2010-09-22 22:13:23,449 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > 891312872292456529 lease expired > 2010-09-22 22:13:23,449 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > 2977737069391109815 lease expired > 2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases: > regionserver60020.leaseChecker closing leases > 2010-09-22 22:13:23,450 INFO org.apache.hadoop.hbase.Leases: > regionserver60020.leaseChecker closed leases > 2010-09-22 22:13:23,450 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-9,5,main] > 2010-09-22 22:13:23,451 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook > thread. > 2010-09-22 22:13:23,452 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. > Wed Sep 22 22:15:11 UTC 2010 Starting regionserver on domU-12-31-39-18-1B-05 > 2010-09-22 22:15:12,716 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > vmInputArguments=[-Xmx768m, -XX:+HeapDumpOnOutOfMemoryError, > -XX:+UseConcMarkSweepGC, -XX:+CMSIncrementalMode, > -XX:+HeapDumpOnOutOfMemoryError, -XX:+UseConcMarkSweepGC, > -XX:+CMSIncrementalMode, -Xmx10000m, -Dhbase.log.dir=/mnt/tmp/hbaselog, > -Dhbase.log.file=hbase-mlcamus-regionserver-domU-12-31-39-18-1B-05.log, > -Dhbase.home.dir=/home/mlcamus/apps/hbase/bin/.., -Dhbase.id.str=mlcamus, > -Dhbase.root.logger=INFO,DRFA] > ... > 2010-09-22 22:25:03,530 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:04,288 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:05,299 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:05,348 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:06,308 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:06,358 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:07,368 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:07,545 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:08,318 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:08,377 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:08,548 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:08,552 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > user,^...@^tj�,1284527871890.bfee076d62bd372a70299f5854517145. > 2010-09-22 22:25:09,288 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:09,558 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > 2010-09-22 22:25:09,698 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; .META.,,1 > ... (these messages continue) > > ON THE MASTER WHEN THE RS WAS SHUTTING DOWN: > > 2010-09-22 21:59:51,588 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2010-09-22 22:00:22,931 INFO org.apache.hadoop.hbase.master.ServerManager: 2 > region servers, 0 dead, average load 134.0 > 2010-09-22 22:00:39,331 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-09-22 22:00:39,333 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete > 2010-09-22 22:00:51,368 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020, > regionname: .META.,,1.1028785192, startKey: <>} > 2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 266 row(s) of meta region {server: > 10.104.55.247:60020, regionname: .META.,,1.1028785192, startKey: <>} complete > 2010-09-22 22:00:51,801 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2010-09-22 22:01:22,941 INFO org.apache.hadoop.hbase.master.ServerManager: 2 > region servers, 0 dead, average load 134.0 > 2010-09-22 22:01:39,341 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.104.37.247:60020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-09-22 22:01:39,343 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.104.37.247:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete > 2010-09-22 22:01:51,371 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: 10.104.55.247:60020, > regionname: .META.,,1.1028785192, startKey: <>} > 2010-09-22 22:02:22,951 INFO org.apache.hadoop.hbase.master.ServerManager: 2 > region servers, 0 dead, average load 134.0 > 2010-09-22 22:02:24,203 INFO org.apache.hadoop.hbase.master.ServerManager: > domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 znode expired > 2010-09-22 22:02:24,204 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 to dead > servers, added shutdown processing operation > 2010-09-22 22:02:24,204 INFO org.apache.hadoop.hbase.master.RegionManager: > META region whose startkey is removed from onlineMetaRegions > 2010-09-22 22:02:24,204 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessServerShutdown of > domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 > 2010-09-22 22:02:24,204 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of > server domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559: > logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, > onlineMetaRegions.size(): 0 > 2010-09-22 22:02:24,242 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Splitting 35 hlog(s) in > hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-1B-05.compute-1.internal,60020,1284840302559 > > ON THE MASTER AFTER GIVING A FEW MINUTES TO RECOVER: > > 2010-09-22 22:31:00,300 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, > onlineMetaRegions.size(): 0 > 2010-09-22 22:31:00,300 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not > all meta regions are online > 2010-09-22 22:31:00,301 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessRegionClose of > userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44., > false, reassign: true > 2010-09-22 22:31:00,301 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, > onlineMetaRegions.size(): 0 > 2010-09-22 22:31:00,301 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not > all meta regions are online > 2010-09-22 22:31:10,310 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessRegionClose of > userh,R\x80\xBD\x0F\xCB\xEAf\x95"+\xE1\x9A)/\xD7\x19,1285081795205.bf1a30fb67c1b0ea421da1f458d83c44., > false, reassign: true > 2010-09-22 22:31:10,310 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, > onlineMetaRegions.size(): 0 > 2010-09-22 22:31:10,310 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not > all meta regions are online > 2010-09-22 22:31:10,311 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessRegionClose of > userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab0058a71cff., > false, reassign: true > 2010-09-22 22:31:10,311 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: numberOfMetaRegions: 1, > onlineMetaRegions.size(): 0 > 2010-09-22 22:31:10,311 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperation: Requeuing because not > all meta regions are online > 2010-09-22 22:31:20,320 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessRegionClose of > userh,\xE5\x1A\x93\xCEQ\xC9\x8E\x8DIlvh9\xE1\xC8O,1285067882016.8aecae7491b95dd76aefab00 > > WHEN RUNNING "list" FROM "hbase shell": > > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2217) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1808) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919) > > >
