Re: YouAreDeadException after DFS Client errors.

Jean-Daniel Cryans Tue, 12 Apr 2011 11:45:44 -0700

YouAreDead means that the master is already processing the death of
those region servers when the region server talks back to the master.
Network split?


J-D

On Tue, Apr 12, 2011 at 11:33 AM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> This was something that happened a week back in our cluster: There was a 
> flash death of region servers: a few of the region servers did have near-full 
> heaps so I thought GC could be at play. But many of them crashed after a few 
> DFS errors followed by a YouAreDeadException and they didn't have GC 
> problems..
>
> This was in a 700 node cluster. Writes happen only through bulk loads. 50 
> regions per region server.
>
> After we restarted the cluster, it started running fine.
>
>
> ------ Forwarded Message
> From: Vidhyashankar Venkataraman <[email protected]>
> Date: Tue, 12 Apr 2011 09:44:24 -0700
> To: stack <[email protected]>
> Conversation: Hbase configs.
> Subject: Re: Hbase configs.
>
> As for the DFS errors:  This was a sample log from one of the region servers 
> that had this flash death last week: Notice that the final nail in the coffin 
> was the YouAreDeadException because of ZK unable to receive a timely update 
> from the regionserver: Usually this happens when the heap is full but that 
> doesn't seem to be the case (grep for FATAL in the following log). But this 
> was preceded by quite a few DFSClient errors. I have usually noticed these 
> DFSClient errors usually when there are too many files floating around: for 
> now, we havent been able to reproduce these errors as such. Please note that 
> splits werent disabled. Compactions were happening once every 5-7 hours per 
> region. 50 regions per node. And the max file/region size was 8 gigs.
>
> 2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient: 
> org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated 
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient: 
> NotReplicatedYetException sleeping 
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
> left 4
> 2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient: 
> org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated 
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient: 
> NotReplicatedYetException sleeping 
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
> left 3
> 2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient: 
> org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated 
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient: 
> NotReplicatedYetException sleeping 
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries 
> left 2
> java.net.SocketTimeoutException: 20000 millis timeout while waiting for 
> channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=b3130001.yst.yahoo.net/67.195.49.108:60020]
> 2011-04-04 03:32:41,280 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: 
> Failed verification of .META.,,1 at address=b3130001.yst.yahoo.net:60020; 
> java.net.ConnectException: Connection refused
> 2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_-3142768255648714362_53688655 from any node: 
> java.io.IOException: No live nodes contain current block. Will get new block 
> locations from namenode and retry...
> 2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_-8807886313487927171_53799733 from any node: 
> java.io.IOException: No live nodes contain current block. Will get new block 
> locations from namenode and retry...
> java.io.InterruptedIOException: Aborting compaction of store content in 
> region 
> WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c.
>  because user requested stop.
> bash-3.00$ grep FATAL 
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04
> bash-3.00$ grep FATAL 
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> 2011-04-05 05:15:38,940 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
> regions=77, usedHeap=2521, maxHeap=7993): 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
> ZooKeeper, aborting
> 2011-04-05 05:15:39,022 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
> regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
> bash-3.00$ grep -b35 -a30 FATAL 
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> 607897318-    at javax.security.auth.Subject.doAs(Subject.java:396)
> 607897373-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> 607897435-
> 607897436-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> 607897491-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> 607897550-    at $Proxy4.commitBlockSynchronization(Unknown Source)
> 607897605-    at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570)
> 607897687-    at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
> 607897772-    at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
> 607897857-    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 607897921-    at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 607898003-    at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 607898093-    at java.lang.reflect.Method.invoke(Method.java:597)
> 607898146-    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> 607898202-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
> 607898266-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958)
> 607898330-    at java.security.AccessController.doPrivileged(Native Method)
> 607898393-    at javax.security.auth.Subject.doAs(Subject.java:396)
> 607898448-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> 607898510-
> 607898511-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> 607898566-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> 607898625-    at $Proxy9.recoverBlock(Unknown Source)
> 607898666-    at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
> 607898761-    at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
> 607898847-    at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
> 607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient: 
> Error Recovery for block blk_8685130736369835846_54964870 failed  because 
> recovery from primary datanode 67.195.53.55:4610 failed 1 times.  Pipeline 
> was 67.195.53.55:4610, 67.195.57.241:4610. Will retry...
> 607899207-2011-04-05 05:15:38,842 INFO 
> org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
> 607899302-2011-04-05 05:15:38,852 INFO 
> org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
> 607899397-2011-04-05 05:15:38,852 INFO 
> org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 607899494:2011-04-05 05:15:38,940 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
> regions=77, usedHeap=2521, maxHeap=7993): 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
> ZooKeeper, aborting
> 607899866-org.apache.zookeeper.KeeperException$SessionExpiredException: 
> KeeperErrorCode = Session expired
> 607899962-    at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
> 607900060-    at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
> 607900150-    at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 607900232-    at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 607900305-2011-04-05 05:15:38,945 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
> request=0.0, regions=77, stores=385, storefiles=1554, 
> storefileIndexSize=2179, memstoreSize=0, compactionQueueSize=55, 
> usedHeap=2522, maxHeap=7993, blockCacheSize=13752888, 
> blockCacheFree=1662631752, blockCacheCount=0, blockCacheHitCount=0, 
> blockCacheMissCount=43383235, blockCacheEvictedCount=0, blockCacheHitRatio=0, 
> blockCacheHitCachingRatio=0
> 607900750-2011-04-05 05:15:38,945 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from 
> ZooKeeper, aborting
> 607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn: 
> EventThread shut down
> 607901076:2011-04-05 05:15:39,022 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, 
> regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
> 607901465-org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
> 607901658-    at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 607901732-    at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> 607901829-    at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> 607901934-    at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> 607902002-    at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
> 607902090-    at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80)
> 607902179-    at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
> 607902280-    at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586)
> 607902363-    at java.lang.Thread.run(Thread.java:619)
> 607902405-Caused by: org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server
> 607902603-    at 
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
> 607902688-    at 
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
> 607902780-    at 
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638)
> 607902860-    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 607902924-    at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 607903014-    at java.lang.reflect.Method.invoke(Method.java:597)
> 607903067-    at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> 607903139-    at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
> 607903218-
> 607903219-    at 
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
> 607903290-    at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> 607903365-    at $Proxy3.regionServerReport(Unknown Source)
> 607903412-    at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723)
> 607903513- ... 2 more
>
>
>
> On 4/11/11 10:12 PM, "stack" <[email protected]> wrote:
>
> It would be worth studying when compactions run..... pick a few
> regions.   You should do all you can to minimize how much compacting
> you do (I wish I could come hang w/ you lot for a week to play with
> this stuff)
> St.Ack
>
> On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman
> <[email protected]> wrote:
>> In fact, I had wanted to get a split for 30 days and then disable the splits
>> (so that I get a rough distribution of urls over a 30 day period: that's the
>> max expiration time of docs).
>>
>> But there were too many things happening to pinpoint the exact bottleneck.
>> So that's our next task once we disable splits: To find out a good
>> compaction frequency. Also, just so you realize, the max compact files is 5
>> which means minor compactions happens roughly every 5 hours or greater for
>> every region.
>>
>>
>> On 4/11/11 9:28 PM, "stack" <[email protected]> wrote:
>>
>>
>> Hmm... OK.
>>
>> Every hour.  Yes.  You want minor compactions to run then.  You want
>> to be careful though that we don't over compact.  We should study your
>> running cluster and see if there is anything we can surmise from how
>> it runs.  See if we can optimize our compaction settings for you
>> particular hourly case.
>>
>
>
> ------ End of Forwarded Message
>

Re: YouAreDeadException after DFS Client errors.

Reply via email to