This was something that happened a week back in our cluster: There was a flash 
death of region servers: a few of the region servers did have near-full heaps 
so I thought GC could be at play. But many of them crashed after a few DFS 
errors followed by a YouAreDeadException and they didn't have GC problems..

This was in a 700 node cluster. Writes happen only through bulk loads. 50 
regions per region server.

After we restarted the cluster, it started running fine.


------ Forwarded Message
From: Vidhyashankar Venkataraman <[email protected]>
Date: Tue, 12 Apr 2011 09:44:24 -0700
To: stack <[email protected]>
Conversation: Hbase configs.
Subject: Re: Hbase configs.

As for the DFS errors:  This was a sample log from one of the region servers 
that had this flash death last week: Notice that the final nail in the coffin 
was the YouAreDeadException because of ZK unable to receive a timely update 
from the regionserver: Usually this happens when the heap is full but that 
doesn't seem to be the case (grep for FATAL in the following log). But this was 
preceded by quite a few DFSClient errors. I have usually noticed these 
DFSClient errors usually when there are too many files floating around: for 
now, we havent been able to reproduce these errors as such. Please note that 
splits werent disabled. Compactions were happening once every 5-7 hours per 
region. 50 regions per node. And the max file/region size was 8 gigs.

2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
replicated 
yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient: 
NotReplicatedYetException sleeping 
/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
left 4
2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
replicated 
yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient: 
NotReplicatedYetException sleeping 
/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
left 3
2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
replicated 
yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient: 
NotReplicatedYetException sleeping 
/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
left 2
java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel 
to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=b3130001.yst.yahoo.net/67.195.49.108:60020]
2011-04-04 03:32:41,280 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: 
Failed verification of .META.,,1 at address=b3130001.yst.yahoo.net:60020; 
java.net.ConnectException: Connection refused
2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_-3142768255648714362_53688655 from any node: java.io.IOException: No 
live nodes contain current block. Will get new block locations from namenode 
and retry...
2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_-8807886313487927171_53799733 from any node: java.io.IOException: No 
live nodes contain current block. Will get new block locations from namenode 
and retry...
java.io.InterruptedIOException: Aborting compaction of store content in region 
WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c.
 because user requested stop.
bash-3.00$ grep FATAL 
hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04
bash-3.00$ grep FATAL 
hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
2011-04-05 05:15:38,940 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
regions=77, usedHeap=2521, maxHeap=7993): 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
ZooKeeper, aborting
2011-04-05 05:15:39,022 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
bash-3.00$ grep -b35 -a30 FATAL 
hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
607897318-    at javax.security.auth.Subject.doAs(Subject.java:396)
607897373-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
607897435-
607897436-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
607897491-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
607897550-    at $Proxy4.commitBlockSynchronization(Unknown Source)
607897605-    at 
org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570)
607897687-    at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
607897772-    at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
607897857-    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
607897921-    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
607898003-    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
607898093-    at java.lang.reflect.Method.invoke(Method.java:597)
607898146-    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
607898202-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
607898266-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958)
607898330-    at java.security.AccessController.doPrivileged(Native Method)
607898393-    at javax.security.auth.Subject.doAs(Subject.java:396)
607898448-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
607898510-
607898511-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
607898566-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
607898625-    at $Proxy9.recoverBlock(Unknown Source)
607898666-    at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
607898761-    at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
607898847-    at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient: Error 
Recovery for block blk_8685130736369835846_54964870 failed  because recovery 
from primary datanode 67.195.53.55:4610 failed 1 times.  Pipeline was 
67.195.53.55:4610, 67.195.57.241:4610. Will retry...
607899207-2011-04-05 05:15:38,842 INFO org.apache.hadoop.io.compress.CodecPool: 
Got brand-new compressor
607899302-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: 
Got brand-new compressor
607899397-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor
607899494:2011-04-05 05:15:38,940 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
regions=77, usedHeap=2521, maxHeap=7993): 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
ZooKeeper, aborting
607899866-org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
607899962-    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
607900060-    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
607900150-    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
607900232-    at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
607900305-2011-04-05 05:15:38,945 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
request=0.0, regions=77, stores=385, storefiles=1554, storefileIndexSize=2179, 
memstoreSize=0, compactionQueueSize=55, usedHeap=2522, maxHeap=7993, 
blockCacheSize=13752888, blockCacheFree=1662631752, blockCacheCount=0, 
blockCacheHitCount=0, blockCacheMissCount=43383235, blockCacheEvictedCount=0, 
blockCacheHitRatio=0, blockCacheHitCachingRatio=0
607900750-2011-04-05 05:15:38,945 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
ZooKeeper, aborting
607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn: 
EventThread shut down
607901076:2011-04-05 05:15:39,022 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
607901465-org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
607901658-    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
607901732-    at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
607901829-    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
607901934-    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
607902002-    at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
607902090-    at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80)
607902179-    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
607902280-    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586)
607902363-    at java.lang.Thread.run(Thread.java:619)
607902405-Caused by: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
607902603-    at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
607902688-    at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
607902780-    at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638)
607902860-    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
607902924-    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
607903014-    at java.lang.reflect.Method.invoke(Method.java:597)
607903067-    at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
607903139-    at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
607903218-
607903219-    at 
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
607903290-    at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
607903365-    at $Proxy3.regionServerReport(Unknown Source)
607903412-    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723)
607903513-    ... 2 more



On 4/11/11 10:12 PM, "stack" <[email protected]> wrote:

It would be worth studying when compactions run..... pick a few
regions.   You should do all you can to minimize how much compacting
you do (I wish I could come hang w/ you lot for a week to play with
this stuff)
St.Ack

On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> In fact, I had wanted to get a split for 30 days and then disable the splits
> (so that I get a rough distribution of urls over a 30 day period: that's the
> max expiration time of docs).
>
> But there were too many things happening to pinpoint the exact bottleneck.
> So that's our next task once we disable splits: To find out a good
> compaction frequency. Also, just so you realize, the max compact files is 5
> which means minor compactions happens roughly every 5 hours or greater for
> every region.
>
>
> On 4/11/11 9:28 PM, "stack" <[email protected]> wrote:
>
>
> Hmm... OK.
>
> Every hour.  Yes.  You want minor compactions to run then.  You want
> to be careful though that we don't over compact.  We should study your
> running cluster and see if there is anything we can surmise from how
> it runs.  See if we can optimize our compaction settings for you
> particular hourly case.
>


------ End of Forwarded Message

Reply via email to