This was something that happened a week back in our cluster: There was a flash death of region servers: a few of the region servers did have near-full heaps so I thought GC could be at play. But many of them crashed after a few DFS errors followed by a YouAreDeadException and they didn't have GC problems..
This was in a 700 node cluster. Writes happen only through bulk loads. 50 regions per region server. After we restarted the cluster, it started running fine. ------ Forwarded Message From: Vidhyashankar Venkataraman <[email protected]> Date: Tue, 12 Apr 2011 09:44:24 -0700 To: stack <[email protected]> Conversation: Hbase configs. Subject: Re: Hbase configs. As for the DFS errors: This was a sample log from one of the region servers that had this flash death last week: Notice that the final nail in the coffin was the YouAreDeadException because of ZK unable to receive a timely update from the regionserver: Usually this happens when the heap is full but that doesn't seem to be the case (grep for FATAL in the following log). But this was preceded by quite a few DFSClient errors. I have usually noticed these DFSClient errors usually when there are too many files floating around: for now, we havent been able to reproduce these errors as such. Please note that splits werent disabled. Compactions were happening once every 5-7 hours per region. 50 regions per node. And the max file/region size was 8 gigs. 2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left 4 2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left 3 2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left 2 java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=b3130001.yst.yahoo.net/67.195.49.108:60020] 2011-04-04 03:32:41,280 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=b3130001.yst.yahoo.net:60020; java.net.ConnectException: Connection refused 2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-3142768255648714362_53688655 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-8807886313487927171_53799733 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... java.io.InterruptedIOException: Aborting compaction of store content in region WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c. because user requested stop. bash-3.00$ grep FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04 bash-3.00$ grep FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05 2011-04-05 05:15:38,940 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77, usedHeap=2521, maxHeap=7993): regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from ZooKeeper, aborting 2011-04-05 05:15:39,022 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server bash-3.00$ grep -b35 -a30 FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05 607897318- at javax.security.auth.Subject.doAs(Subject.java:396) 607897373- at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956) 607897435- 607897436- at org.apache.hadoop.ipc.Client.call(Client.java:742) 607897491- at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) 607897550- at $Proxy4.commitBlockSynchronization(Unknown Source) 607897605- at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570) 607897687- at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551) 607897772- at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617) 607897857- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 607897921- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 607898003- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 607898093- at java.lang.reflect.Method.invoke(Method.java:597) 607898146- at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) 607898202- at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) 607898266- at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958) 607898330- at java.security.AccessController.doPrivileged(Native Method) 607898393- at javax.security.auth.Subject.doAs(Subject.java:396) 607898448- at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956) 607898510- 607898511- at org.apache.hadoop.ipc.Client.call(Client.java:742) 607898566- at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) 607898625- at $Proxy9.recoverBlock(Unknown Source) 607898666- at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706) 607898761- at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173) 607898847- at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372) 607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8685130736369835846_54964870 failed because recovery from primary datanode 67.195.53.55:4610 failed 1 times. Pipeline was 67.195.53.55:4610, 67.195.57.241:4610. Will retry... 607899207-2011-04-05 05:15:38,842 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 607899302-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 607899397-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 607899494:2011-04-05 05:15:38,940 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77, usedHeap=2521, maxHeap=7993): regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from ZooKeeper, aborting 607899866-org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired 607899962- at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) 607900060- at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) 607900150- at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) 607900232- at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 607900305-2011-04-05 05:15:38,945 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=77, stores=385, storefiles=1554, storefileIndexSize=2179, memstoreSize=0, compactionQueueSize=55, usedHeap=2522, maxHeap=7993, blockCacheSize=13752888, blockCacheFree=1662631752, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=43383235, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0 607900750-2011-04-05 05:15:38,945 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from ZooKeeper, aborting 607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 607901076:2011-04-05 05:15:39,022 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server 607901465-org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server 607901658- at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 607901732- at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) 607901829- at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) 607901934- at java.lang.reflect.Constructor.newInstance(Constructor.java:513) 607902002- at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96) 607902090- at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80) 607902179- at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729) 607902280- at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586) 607902363- at java.lang.Thread.run(Thread.java:619) 607902405-Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as dead server 607902603- at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) 607902688- at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) 607902780- at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638) 607902860- at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) 607902924- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 607903014- at java.lang.reflect.Method.invoke(Method.java:597) 607903067- at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) 607903139- at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) 607903218- 607903219- at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) 607903290- at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) 607903365- at $Proxy3.regionServerReport(Unknown Source) 607903412- at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723) 607903513- ... 2 more On 4/11/11 10:12 PM, "stack" <[email protected]> wrote: It would be worth studying when compactions run..... pick a few regions. You should do all you can to minimize how much compacting you do (I wish I could come hang w/ you lot for a week to play with this stuff) St.Ack On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman <[email protected]> wrote: > In fact, I had wanted to get a split for 30 days and then disable the splits > (so that I get a rough distribution of urls over a 30 day period: that's the > max expiration time of docs). > > But there were too many things happening to pinpoint the exact bottleneck. > So that's our next task once we disable splits: To find out a good > compaction frequency. Also, just so you realize, the max compact files is 5 > which means minor compactions happens roughly every 5 hours or greater for > every region. > > > On 4/11/11 9:28 PM, "stack" <[email protected]> wrote: > > > Hmm... OK. > > Every hour. Yes. You want minor compactions to run then. You want > to be careful though that we don't over compact. We should study your > running cluster and see if there is anything we can surmise from how > it runs. See if we can optimize our compaction settings for you > particular hourly case. > ------ End of Forwarded Message
