Re: YouAreDeadException after DFS Client errors.

Dhruba Borthakur Tue, 12 Apr 2011 13:54:51 -0700

At first sight, it seems to be a network problem. There were some datanodes
that were un-reachable from the HBase regionserver, as evidenced by the
following logs:


53799733 from any node: java.io.IOException: No live nodes contain current
block. Will get new block locations from namenode and retry...

The regionserver is designed to shutdown when it encounters DFS errors. So,
the HBase YouAreDeadException are expected.

-dhruba


On Tue, Apr 12, 2011 at 11:45 AM, Jean-Daniel Cryans <[email protected]>wrote:

> YouAreDead means that the master is already processing the death of
> those region servers when the region server talks back to the master.
> Network split?
>
> J-D
>
> On Tue, Apr 12, 2011 at 11:33 AM, Vidhyashankar Venkataraman
> <[email protected]> wrote:
> > This was something that happened a week back in our cluster: There was a
> flash death of region servers: a few of the region servers did have
> near-full heaps so I thought GC could be at play. But many of them crashed
> after a few DFS errors followed by a YouAreDeadException and they didn't
> have GC problems..
> >
> > This was in a 700 node cluster. Writes happen only through bulk loads. 50
> regions per region server.
> >
> > After we restarted the cluster, it started running fine.
> >
> >
> > ------ Forwarded Message
> > From: Vidhyashankar Venkataraman <[email protected]>
> > Date: Tue, 12 Apr 2011 09:44:24 -0700
> > To: stack <[email protected]>
> > Conversation: Hbase configs.
> > Subject: Re: Hbase configs.
> >
> > As for the DFS errors:  This was a sample log from one of the region
> servers that had this flash death last week: Notice that the final nail in
> the coffin was the YouAreDeadException because of ZK unable to receive a
> timely update from the regionserver: Usually this happens when the heap is
> full but that doesn't seem to be the case (grep for FATAL in the following
> log). But this was preceded by quite a few DFSClient errors. I have usually
> noticed these DFSClient errors usually when there are too many files
> floating around: for now, we havent been able to reproduce these errors as
> such. Please note that splits werent disabled. Compactions were happening
> once every 5-7 hours per region. 50 regions per node. And the max
> file/region size was 8 gigs.
> >
> > 2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> replicated
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> > 2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient:
> NotReplicatedYetException sleeping
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries
> left 4
> > 2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> replicated
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> > 2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient:
> NotReplicatedYetException sleeping
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries
> left 3
> > 2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> replicated
> yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> > 2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient:
> NotReplicatedYetException sleeping
> /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries
> left 2
> > java.net.SocketTimeoutException: 20000 millis timeout while waiting for
> channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=
> b3130001.yst.yahoo.net/67.195.49.108:60020]
> > 2011-04-04 03:32:41,280 INFO
> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> .META.,,1 at address=b3130001.yst.yahoo.net:60020;
> java.net.ConnectException: Connection refused
> > 2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-3142768255648714362_53688655 from any node:
> java.io.IOException: No live nodes contain current block. Will get new block
> locations from namenode and retry...
> > 2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-8807886313487927171_53799733 from any node:
> java.io.IOException: No live nodes contain current block. Will get new block
> locations from namenode and retry...
> > java.io.InterruptedIOException: Aborting compaction of store content in
> region
> WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c.
> because user requested stop.
> > bash-3.00$ grep FATAL
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04
> > bash-3.00$ grep FATAL
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> > 2011-04-05 05:15:38,940 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
> regions=77, usedHeap=2521, maxHeap=7993):
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from
> ZooKeeper, aborting
> > 2011-04-05 05:15:39,022 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
> regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead
> server
> > bash-3.00$ grep -b35 -a30 FATAL
> hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> > 607897318-    at javax.security.auth.Subject.doAs(Subject.java:396)
> > 607897373-    at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> > 607897435-
> > 607897436-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> > 607897491-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> > 607897550-    at $Proxy4.commitBlockSynchronization(Unknown Source)
> > 607897605-    at
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570)
> > 607897687-    at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
> > 607897772-    at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
> > 607897857-    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > 607897921-    at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > 607898003-    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > 607898093-    at java.lang.reflect.Method.invoke(Method.java:597)
> > 607898146-    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> > 607898202-    at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
> > 607898266-    at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958)
> > 607898330-    at java.security.AccessController.doPrivileged(Native
> Method)
> > 607898393-    at javax.security.auth.Subject.doAs(Subject.java:396)
> > 607898448-    at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> > 607898510-
> > 607898511-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> > 607898566-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> > 607898625-    at $Proxy9.recoverBlock(Unknown Source)
> > 607898666-    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
> > 607898761-    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
> > 607898847-    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
> > 607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient:
> Error Recovery for block blk_8685130736369835846_54964870 failed  because
> recovery from primary datanode 67.195.53.55:4610 failed 1 times.  Pipeline
> was 67.195.53.55:4610, 67.195.57.241:4610. Will retry...
> > 607899207-2011-04-05 05:15:38,842 INFO
> org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
> > 607899302-2011-04-05 05:15:38,852 INFO
> org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
> > 607899397-2011-04-05 05:15:38,852 INFO
> org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> > 607899494:2011-04-05 05:15:38,940 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
> regions=77, usedHeap=2521, maxHeap=7993):
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from
> ZooKeeper, aborting
> > 607899866-org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> > 607899962-    at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
> > 607900060-    at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
> > 607900150-    at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> > 607900232-    at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> > 607900305-2011-04-05 05:15:38,945 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=77, stores=385, storefiles=1554,
> storefileIndexSize=2179, memstoreSize=0, compactionQueueSize=55,
> usedHeap=2522, maxHeap=7993, blockCacheSize=13752888,
> blockCacheFree=1662631752, blockCacheCount=0, blockCacheHitCount=0,
> blockCacheMissCount=43383235, blockCacheEvictedCount=0,
> blockCacheHitRatio=0, blockCacheHitCachingRatio=0
> > 607900750-2011-04-05 05:15:38,945 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
> regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from
> ZooKeeper, aborting
> > 607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> > 607901076:2011-04-05 05:15:39,022 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
> regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead
> server
> > 607901465-org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead
> server
> > 607901658-    at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> > 607901732-    at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > 607901829-    at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > 607901934-    at
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > 607902002-    at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
> > 607902090-    at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80)
> > 607902179-    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
> > 607902280-    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586)
> > 607902363-    at java.lang.Thread.run(Thread.java:619)
> > 607902405-Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead
> server
> > 607902603-    at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
> > 607902688-    at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
> > 607902780-    at
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638)
> > 607902860-    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
> Source)
> > 607902924-    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > 607903014-    at java.lang.reflect.Method.invoke(Method.java:597)
> > 607903067-    at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > 607903139-    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
> > 607903218-
> > 607903219-    at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
> > 607903290-    at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> > 607903365-    at $Proxy3.regionServerReport(Unknown Source)
> > 607903412-    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723)
> > 607903513- ... 2 more
> >
> >
> >
> > On 4/11/11 10:12 PM, "stack" <[email protected]> wrote:
> >
> > It would be worth studying when compactions run..... pick a few
> > regions.   You should do all you can to minimize how much compacting
> > you do (I wish I could come hang w/ you lot for a week to play with
> > this stuff)
> > St.Ack
> >
> > On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman
> > <[email protected]> wrote:
> >> In fact, I had wanted to get a split for 30 days and then disable the
> splits
> >> (so that I get a rough distribution of urls over a 30 day period: that's
> the
> >> max expiration time of docs).
> >>
> >> But there were too many things happening to pinpoint the exact
> bottleneck.
> >> So that's our next task once we disable splits: To find out a good
> >> compaction frequency. Also, just so you realize, the max compact files
> is 5
> >> which means minor compactions happens roughly every 5 hours or greater
> for
> >> every region.
> >>
> >>
> >> On 4/11/11 9:28 PM, "stack" <[email protected]> wrote:
> >>
> >>
> >> Hmm... OK.
> >>
> >> Every hour.  Yes.  You want minor compactions to run then.  You want
> >> to be careful though that we don't over compact.  We should study your
> >> running cluster and see if there is anything we can surmise from how
> >> it runs.  See if we can optimize our compaction settings for you
> >> particular hourly case.
> >>
> >
> >
> > ------ End of Forwarded Message
> >
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: YouAreDeadException after DFS Client errors.

Reply via email to