YouAreDead means that the master is already processing the death of those region servers when the region server talks back to the master. Network split?
J-D On Tue, Apr 12, 2011 at 11:33 AM, Vidhyashankar Venkataraman <[email protected]> wrote: > This was something that happened a week back in our cluster: There was a > flash death of region servers: a few of the region servers did have near-full > heaps so I thought GC could be at play. But many of them crashed after a few > DFS errors followed by a YouAreDeadException and they didn't have GC > problems.. > > This was in a 700 node cluster. Writes happen only through bulk loads. 50 > regions per region server. > > After we restarted the cluster, it started running fine. > > > ------ Forwarded Message > From: Vidhyashankar Venkataraman <[email protected]> > Date: Tue, 12 Apr 2011 09:44:24 -0700 > To: stack <[email protected]> > Conversation: Hbase configs. > Subject: Re: Hbase configs. > > As for the DFS errors: This was a sample log from one of the region servers > that had this flash death last week: Notice that the final nail in the coffin > was the YouAreDeadException because of ZK unable to receive a timely update > from the regionserver: Usually this happens when the heap is full but that > doesn't seem to be the case (grep for FATAL in the following log). But this > was preceded by quite a few DFSClient errors. I have usually noticed these > DFSClient errors usually when there are too many files floating around: for > now, we havent been able to reproduce these errors as such. Please note that > splits werent disabled. Compactions were happening once every 5-7 hours per > region. 50 regions per node. And the max file/region size was 8 gigs. > > 2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated > yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 > 2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient: > NotReplicatedYetException sleeping > /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries > left 4 > 2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated > yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 > 2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient: > NotReplicatedYetException sleeping > /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries > left 3 > 2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated > yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 > 2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient: > NotReplicatedYetException sleeping > /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries > left 2 > java.net.SocketTimeoutException: 20000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending > remote=b3130001.yst.yahoo.net/67.195.49.108:60020] > 2011-04-04 03:32:41,280 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: > Failed verification of .META.,,1 at address=b3130001.yst.yahoo.net:60020; > java.net.ConnectException: Connection refused > 2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not > obtain block blk_-3142768255648714362_53688655 from any node: > java.io.IOException: No live nodes contain current block. Will get new block > locations from namenode and retry... > 2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not > obtain block blk_-8807886313487927171_53799733 from any node: > java.io.IOException: No live nodes contain current block. Will get new block > locations from namenode and retry... > java.io.InterruptedIOException: Aborting compaction of store content in > region > WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c. > because user requested stop. > bash-3.00$ grep FATAL > hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04 > bash-3.00$ grep FATAL > hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05 > 2011-04-05 05:15:38,940 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, > regions=77, usedHeap=2521, maxHeap=7993): > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from > ZooKeeper, aborting > 2011-04-05 05:15:39,022 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, > regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server > bash-3.00$ grep -b35 -a30 FATAL > hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05 > 607897318- at javax.security.auth.Subject.doAs(Subject.java:396) > 607897373- at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956) > 607897435- > 607897436- at org.apache.hadoop.ipc.Client.call(Client.java:742) > 607897491- at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > 607897550- at $Proxy4.commitBlockSynchronization(Unknown Source) > 607897605- at > org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570) > 607897687- at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551) > 607897772- at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617) > 607897857- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 607897921- at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > 607898003- at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > 607898093- at java.lang.reflect.Method.invoke(Method.java:597) > 607898146- at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) > 607898202- at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) > 607898266- at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958) > 607898330- at java.security.AccessController.doPrivileged(Native Method) > 607898393- at javax.security.auth.Subject.doAs(Subject.java:396) > 607898448- at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956) > 607898510- > 607898511- at org.apache.hadoop.ipc.Client.call(Client.java:742) > 607898566- at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > 607898625- at $Proxy9.recoverBlock(Unknown Source) > 607898666- at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706) > 607898761- at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173) > 607898847- at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372) > 607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient: > Error Recovery for block blk_8685130736369835846_54964870 failed because > recovery from primary datanode 67.195.53.55:4610 failed 1 times. Pipeline > was 67.195.53.55:4610, 67.195.57.241:4610. Will retry... > 607899207-2011-04-05 05:15:38,842 INFO > org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor > 607899302-2011-04-05 05:15:38,852 INFO > org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor > 607899397-2011-04-05 05:15:38,852 INFO > org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor > 607899494:2011-04-05 05:15:38,940 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, > regions=77, usedHeap=2521, maxHeap=7993): > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from > ZooKeeper, aborting > 607899866-org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > 607899962- at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) > 607900060- at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) > 607900150- at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > 607900232- at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 607900305-2011-04-05 05:15:38,945 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=0.0, regions=77, stores=385, storefiles=1554, > storefileIndexSize=2179, memstoreSize=0, compactionQueueSize=55, > usedHeap=2522, maxHeap=7993, blockCacheSize=13752888, > blockCacheFree=1662631752, blockCacheCount=0, blockCacheHitCount=0, > blockCacheMissCount=43383235, blockCacheEvictedCount=0, blockCacheHitRatio=0, > blockCacheHitCachingRatio=0 > 607900750-2011-04-05 05:15:38,945 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 > regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from > ZooKeeper, aborting > 607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 607901076:2011-04-05 05:15:39,022 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, > regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server > 607901465-org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server > 607901658- at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > 607901732- at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > 607901829- at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > 607901934- at > java.lang.reflect.Constructor.newInstance(Constructor.java:513) > 607902002- at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96) > 607902090- at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80) > 607902179- at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729) > 607902280- at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586) > 607902363- at java.lang.Thread.run(Thread.java:619) > 607902405-Caused by: org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server > 607902603- at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) > 607902688- at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) > 607902780- at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638) > 607902860- at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > 607902924- at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > 607903014- at java.lang.reflect.Method.invoke(Method.java:597) > 607903067- at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > 607903139- at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) > 607903218- > 607903219- at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) > 607903290- at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > 607903365- at $Proxy3.regionServerReport(Unknown Source) > 607903412- at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723) > 607903513- ... 2 more > > > > On 4/11/11 10:12 PM, "stack" <[email protected]> wrote: > > It would be worth studying when compactions run..... pick a few > regions. You should do all you can to minimize how much compacting > you do (I wish I could come hang w/ you lot for a week to play with > this stuff) > St.Ack > > On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman > <[email protected]> wrote: >> In fact, I had wanted to get a split for 30 days and then disable the splits >> (so that I get a rough distribution of urls over a 30 day period: that's the >> max expiration time of docs). >> >> But there were too many things happening to pinpoint the exact bottleneck. >> So that's our next task once we disable splits: To find out a good >> compaction frequency. Also, just so you realize, the max compact files is 5 >> which means minor compactions happens roughly every 5 hours or greater for >> every region. >> >> >> On 4/11/11 9:28 PM, "stack" <[email protected]> wrote: >> >> >> Hmm... OK. >> >> Every hour. Yes. You want minor compactions to run then. You want >> to be careful though that we don't over compact. We should study your >> running cluster and see if there is anything we can surmise from how >> it runs. See if we can optimize our compaction settings for you >> particular hourly case. >> > > > ------ End of Forwarded Message >
