Ben, the compaction is done in a background thread, it doesn't block anything. Now if you had a heap close to 2GB, you could easily run into issues.
J-D On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote: > Just to chime in here, the other thing we changed was our max_file_size is > now set to 2gb instead of 512mb. This could be causing long compaction > times. If a compaction takes too long it won't respond and can be marked as > dead. I have had this happen on my dev cluster a few times. > > -Ben > > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans > <[email protected]>wrote: > >> This is probably a red herring, for example if the region server had a >> big GC pause then the master could have already split the log and the >> region server wouldn't be able to close it (that's our version of IO >> fencing). So from that exception look back in the log and see if >> there's anything like : >> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have >> not heard from server in some_big_number ms >> >> J-D >> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins >> <[email protected]> wrote: >> > >> > Thanks for the response stack. Yes we tried increasing >> dfs.datanode.handler.count to 8. At this point I would say it didn't seem >> to resolve the issue we are seeing, but we it also doesn't seem to be >> hurting anything so for right now we're going to leave it in at 8 while we >> continue to debug. >> > >> > In regard to the original error I posted ( Block 'x' is not valid ) we >> have chased that down thanks to your suggestion of looking at the logs for >> the history of the block. It _looks_ like our 'is not valid' block errors >> are unrelated and due to chmod or deleting mapreduce output directories >> directly after a run. We are still isolating that but it looks like it's >> not HBase releated so I'll move that to another list. Thank you very much >> for your debugging suggestions. >> > >> > The one issue we are still seeing is that we will occasionally have a >> regionserver die with the following exception. I need to chase that down a >> little more but it seems similar to a post from 2/13/2011 ( >> http://www.mail-archive.com/[email protected]/msg05550.html ) that I'm >> not sure was ever resolved or not. If anyone has any insight on how to >> debug the following error a little more I would appreciate any thoughts you >> might have. >> > >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : >> java.io.IOException: Error Recovery for block >> blk_1315316969665710488_29842654 failed because recovery from primary >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was 10.18.0.16:50010. >> Aborting... >> > java.io.IOException: Error Recovery for block >> blk_1315316969665710488_29842654 failed because recovery from primary >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was 10.18.0.16:50010. >> Aborting... >> > at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841) >> > at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305) >> > >> > Other than the above exception causing a region server to die >> occasionally everything seems to be working well. >> > >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception. >> We do have ulimit set ( memory unlimited and files 32k ) for the user >> running hbase. >> > >> > Thanks again for your help >> > >> > Andy >> > >> > -----Original Message----- >> > From: [email protected] [mailto:[email protected]] On Behalf Of >> Stack >> > Sent: Sunday, April 10, 2011 1:16 PM >> > To: [email protected] >> > Cc: Andy Sautins >> > Subject: Re: DFS stability running HBase and >> dfs.datanode.handler.count... >> > >> > Did you try upping it Andy? Andrew Purtell's recommendation though old >> would have come of experience. The Intel article reads like sales but there >> is probably merit to its suggestion. The Cloudera article is more unsure >> about the effect of upping handlers though it allows "...could be set a bit >> higher." >> > >> > I just looked at our prod frontend and its set to 3 still. I don't see >> your exceptions in our DN log. >> > >> > What version of hadoop? You say hbase 0.91. You mean 0.90.1? >> > >> > ulimit and nproc are set sufficiently high for hadoop/hbase user? >> > >> > If you grep 163126943925471435_28809750 in namenode log, do you see a >> delete occur before a later open? >> > >> > St.Ack >> > >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins < >> [email protected]> wrote: >> >> >> >> I ran across an mailing list posting from 1/4/2009 that seemed to >> indicate increasing dfs.datanode.handler.count could help improve DFS >> stability ( >> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E). >> The posting seems to indicate the wiki was updated, but I don't seen >> anything in the wiki about increasing dfs.datanode.handler.count. I have >> seen a few other notes that seem to show examples that have raised >> dfs.datanode.handler.count including one from an IBM article ( >> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) >> and the Pro Hadoop book, but other than that the only other mention I see >> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count ( >> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/). >> >> >> >> Given the post is from 2009 I thought I'd ask if anyone has had any >> success improving stability of HBase/DFS when increasing >> dfs.datanode.handler.count. The specific error we are seeing somewhat >> frequently ( few hundred times per day ) in the datanode longs is as >> follows: >> >> >> >> 2011-04-09 00:12:48,035 ERROR >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> >> DatanodeRegistration(10.18.0.33:50010, >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, >> >> infoPort=50075, ipcPort=50020):DataXceiver >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not >> valid. >> >> >> >> The above seems to correspond to ClosedChannelExceptions in the hbase >> regionserver logs as well as some warnings about long write to hlog ( some >> in the 50+ seconds ). >> >> >> >> The biggest end-user facing issue we are seeing is that Task Trackers >> keep getting blacklisted. It's quite possible our problem is unrelated to >> anything HBase, but I thought it was worth asking given what we've been >> seeing. >> >> >> >> We are currently running 0.91 on an 18 node cluster with ~3k total >> regions and each region server is running with 2G of memory. >> >> >> >> Any insight would be appreciated. >> >> >> >> Thanks >> >> >> >> Andy >> >> >> > >> >
