FW: recovering from a region server failure

Steve Boyle Mon, 14 May 2012 14:00:24 -0700

Following up on a discussion that started in asynchbase group...

It looks like our RS failure was related to a ZooKeeper timeout, seems we may 
have overloaded that RS.  The cause of the failure is not as important to me 
right now as our ability to recover from the failure.  To answer some of 
Stack's questions:

We are running hbase version hbase-0.90.4-cdh3u3.

We are running hadoop version hadoop-0.20.2-cdh3u3.

Anything particular about this 'OPENING' region?  If you trace it in the master 
log, anything anomalous going on (it was being balanced when the regionserver 
it was on crashed or something like that)?

The anomalous thing seems to be this:
2012-05-12 00:58:14,325 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/recovered.edits/0000000017087735812.temp
 (wrote 3015517 edits in 34450ms)

There were only 3 tables/regions with a large number of edits - all 3 of them 
contain only Atomic Increments.  This is the only region that had over 3M 
edits, the other 2 similar regions (each for a different table that holds 
similar data) had < 825K edits.  All other tables/regions this server opened 
had < 10K edits.  Not clear why this one region had so many edits to write, one 
of the other Atomic Increment only tables gets more rows written to it and a 
has the same write rate/flush interval in the client code (we coalesce atomic 
increments into batches and write the batches out on a timer).

Sounds like something up w/ this particular region.  When its opening on 
another regionserver, whats showing in the regionserver logs?

Looks like it is replaying logs and the ZK lease times out during that time:
2012-05-12 00:58:28,767 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44., 
current region memstore size 128.0m; wal is null, using passed 
sequenceid=17087861627
2012-05-12 00:58:28,767 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Finished snapshotting, commencing flushing stores
2012-05-12 00:58:30,054 INFO org.apache.hadoop.hbase.regionserver.Store: 
Renaming flushed file at 
hdfs://db1/hbase/PROD_rHour/fe15757724e007251016b638ab55f151/.tmp/3732133664174711660
 to 
hdfs://db1/hbase/PROD_rHour/fe15757724e007251016b638ab55f151/C/5986787969762915560
2012-05-12 00:58:30,057 INFO org.apache.hadoop.hbase.regionserver.Store: Added 
hdfs://db1/hbase/PROD_rHour/fe15757724e007251016b638ab55f151/C/5986787969762915560,
 entries=579194, sequenceid=17091885583, memsize=128.0m, filesize=5.4m
2012-05-12 00:58:30,057 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Finished memstore flush of ~128.0m for region 
PROD_rHour,,1323484845768.fe15757724e007251016b638ab55f151. in 3976ms, 
sequenceid=17091885583, compaction requested=false; wal=null
2012-05-12 00:58:32,982 INFO org.apache.hadoop.hbase.regionserver.Store: 
Renaming flushed file at 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/.tmp/2211059944788521319
 to 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/C/414592015524271636
2012-05-12 00:58:32,985 INFO org.apache.hadoop.hbase.regionserver.Store: Added 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/C/414592015524271636,
 entries=573364, sequenceid=17087861627, memsize=128.0m, filesize=6.1m
2012-05-12 00:58:32,985 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Finished memstore flush of ~128.0m for region 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44. in 
4218ms, sequenceid=17087861627, compaction requested=true; wal=null
2012-05-12 00:58:34,722 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
PROD_rHour,,1323484845768.fe15757724e007251016b638ab55f151., current region 
memstore size 128.0m; wal is null, using passed sequenceid=17092246834
2012-05-12 00:58:34,722 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Finished snapshotting, commencing flushing stores
2012-05-12 00:58:39,260 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44., 
current region memstore size 128.0m; wal is null, using passed 
sequenceid=17087925859

2012-05-12 00:58:44,807 INFO org.apache.hadoop.hbase.regionserver.Store: 
Renaming flushed file at 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/.tmp/6354552965813660275
 to 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/C/6008227162161883094
2012-05-12 00:58:44,810 INFO org.apache.hadoop.hbase.regionserver.Store: Added 
hdfs://db1/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/C/6008227162161883094,
 entries=572482, sequenceid=17087925859, memsize=128.0m, filesize=6.1m
2012-05-12 00:58:44,810 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Finished memstore flush of ~128.0m for region 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44. in 
5550ms, sequenceid=17087925859, compaction requested=true; wal=null
2012-05-12 00:58:46,708 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Applied 1633459, skipped 554562, firstSequenceidInLog=17087735911, 
maxSequenceidInLog=17092544014

memstore flushing continutes for some time... until...

2012-05-12 01:05:19,756 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44., 
current region memstore size 128.0m; wal is null, using passed 
sequenceid=17090407726
2012-05-12 01:05:19,756 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Finished snapshotting, commencing flushing stores
2012-05-12 01:05:23,547 ERROR 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of 
region=PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44.
org.apache.hadoop.hbase.DroppedSnapshotException: region: 
PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44.
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1055)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:2042)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:1917)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:367)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2770)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2756)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:312)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase/PROD_rDay/9d0f42dd7d703c8ae029e31ac7f13f44/.tmp/66369361
75557646629 File does not exist. [Lease.  Holder: 
DFSClient_hb_rs_hb5,60020,1336672254585_1336672255146, pendingcreates: 2]
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1593)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1584)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1639)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1627)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:687)
        at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

        at org.apache.hadoop.ipc.Client.call(Client.java:1107)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy5.complete(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.complete(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3941)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3856)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:655)
        at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:877)
        at 
org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:539)
        at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:494)
        at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:83)
        at 
org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1573)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1033)
        ... 11 more
2012-05-12 01:05:23,547 INFO 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of 
region REGION => {NAME => 
'PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44.', 
STARTKEY => 'm:669:20111104', ENDKEY => '', ENCODED => 
9d0f42dd7d703c8ae029e31ac7f13f44, TABLE => {{NAME => 'PROD_rDay', FAMILIES => 
[{NAME => 'C', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => 
'1', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY 
=> 'true', BLOCKCACHE => 'true'}]}} failed, marking as FAILED_OPEN in ZK
2012-05-12 01:05:23,547 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x23660ab0d427043 Attempting to transition node 
9d0f42dd7d703c8ae029e31ac7f13f44 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_FAILED_OPEN
2012-05-12 01:05:23,547 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x23660ab0d427043 Attempt to transition the unassigned node 
for 9d0f42dd7d703c8ae029e31ac7f13f44 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version 7 not the 
expected version 5
2012-05-12 01:05:23,547 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark 
region REGION => {NAME => 
'PROD_rDay,m:669:20111104,1323489256750.9d0f42dd7d703c8ae029e31ac7f13f44.', 
STARTKEY => 'm:669:20111104', ENDKEY => '', ENCODED => 
9d0f42dd7d703c8ae029e31ac7f13f44, TABLE => {{NAME => 'PROD_rDay', FAMILIES => 
[{NAME => 'C', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => 
'1', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY 
=> 'true', BLOCKCACHE => 'true'}]}} as FAILED_OPEN. It's likely that the master 
already timed out this open attempt, and thus another RS already has the region.

I'm not clear why this particular region had so many edits to write compared to 
the other regions that were running on the failed server.  It seems that 
flushing the memstore is taking a long time and allowing the lease for the 
region to timeout.  What can be done about that?

Per the history below, once we bring the failed RS back online, all the issues 
seem to clear up and things start working as expected again.  I'm not clear on 
what the failed RS is doing at startup to clear the issue up.  I'd like to get 
to a point where the failed RS can stay down indefinitely and all of the 
regions come back up on other RSs without intervention.

Thanks,
Steve

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Monday, May 07, 2012 1:55 PM
To: Steve Boyle
Subject: Re: recovering from a region server failure

On Mon, May 7, 2012 at 11:09 AM, Steve Boyle <[email protected]> wrote:
> We experienced our Region Server failure today and I'm hoping I can get your 
> opinion on what we saw.  Everything was running fine then we have a Region 
> Server crash (not sure why it crashed at the moment).

It'll say why in the logs.  Ou might have to read up from the last log message 
a page or two to get to the root reason (look for stuff like expired zookeeper 
session -- thats a common one).

> All of the regions that the failed RS was serving fail over to other RSs 
> pretty quickly (quick enough for our purposes, say in less than a minute), 
> except for a single region.  One region seems to get stuck in the 'opening' 
> state on another, running, RS.

What version of hbase? (I went back over this correspondence and I don't think 
you say)

Anything particular about this 'OPENING' region?  If you trace it in the master 
log, anything anomalous going on (it was being balanced when the regionserver 
it was on crashed or something like that)?
Master log is noisy but you can usually figure basics w/ a grep on region name.

> We go into the hbase shell and try to 'unassign' the region (with the force 
>flag), this causes the region to get stuck in 'opening' on yet another region 
>server.  We try to 'unassign' the region several times after is does not open 
>on a given server for about 30-45 seconds, it does not seem to open at all.

Sounds like something up w/ this particular region.  When its opening on 
another regionserver, whats showing in the regionserver logs?

> Somewhere in this process, another person restarts the failed RS - just as 
>soon as the failed RS is restarted, the region that is stuck in 'opening' is 
>able to open/start on whatever region server it was last waiting to open on.  
>It seems odd that we need to restart the failed RS to get all of the regions 
>it used to be serving back online.  Any suggestions as to how we can better 
>understand or even solve this problem?
>

Hmm..  What version of hadoop?  Are you running an appropriate hadoop?
 See http://hbase.apache.org/book.html#hadoop (I'm guessing that if you looked 
in master logs you'd see complaint that master is unable to split WAL logs 
because lease is still held by the regionserver that 'crashed'.... this would 
happen if you were running an hadoop that didn't support sync and that was 
missing the steal-lease api necessary to hbase smooth sailing).

We should also probably do any continuation over on hbase lists Steve rather 
than here on the async list since it seems you are having hbase rather than 
asynchbase issues

Go easy,
St.Ack

FW: recovering from a region server failure

Reply via email to