Then this would be a heap limitation problem, have a look at your GC log during the compaction.
J-D On Thu, Apr 14, 2011 at 10:31 AM, Ben Aldrich <[email protected]> wrote: > Our heapsize is set to 2gb, I think my dev issue was because I was running > things off of a few vm's. Even though the compaction is in another thread it > would still fail to respond during major compaction. > > -Ben > > On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans > <[email protected]>wrote: > >> Ben, the compaction is done in a background thread, it doesn't block >> anything. Now if you had a heap close to 2GB, you could easily run >> into issues. >> >> J-D >> >> On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote: >> > Just to chime in here, the other thing we changed was our max_file_size >> is >> > now set to 2gb instead of 512mb. This could be causing long compaction >> > times. If a compaction takes too long it won't respond and can be marked >> as >> > dead. I have had this happen on my dev cluster a few times. >> > >> > -Ben >> > >> > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans < >> [email protected]>wrote: >> > >> >> This is probably a red herring, for example if the region server had a >> >> big GC pause then the master could have already split the log and the >> >> region server wouldn't be able to close it (that's our version of IO >> >> fencing). So from that exception look back in the log and see if >> >> there's anything like : >> >> >> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have >> >> not heard from server in some_big_number ms >> >> >> >> J-D >> >> >> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins >> >> <[email protected]> wrote: >> >> > >> >> > Thanks for the response stack. Yes we tried increasing >> >> dfs.datanode.handler.count to 8. At this point I would say it didn't >> seem >> >> to resolve the issue we are seeing, but we it also doesn't seem to be >> >> hurting anything so for right now we're going to leave it in at 8 while >> we >> >> continue to debug. >> >> > >> >> > In regard to the original error I posted ( Block 'x' is not valid ) >> we >> >> have chased that down thanks to your suggestion of looking at the logs >> for >> >> the history of the block. It _looks_ like our 'is not valid' block >> errors >> >> are unrelated and due to chmod or deleting mapreduce output directories >> >> directly after a run. We are still isolating that but it looks like >> it's >> >> not HBase releated so I'll move that to another list. Thank you very >> much >> >> for your debugging suggestions. >> >> > >> >> > The one issue we are still seeing is that we will occasionally have >> a >> >> regionserver die with the following exception. I need to chase that >> down a >> >> little more but it seems similar to a post from 2/13/2011 ( >> >> http://www.mail-archive.com/[email protected]/msg05550.html ) that >> I'm >> >> not sure was ever resolved or not. If anyone has any insight on how to >> >> debug the following error a little more I would appreciate any thoughts >> you >> >> might have. >> >> > >> >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: >> Exception >> >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net >> >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : >> >> java.io.IOException: Error Recovery for block >> >> blk_1315316969665710488_29842654 failed because recovery from primary >> >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was >> 10.18.0.16:50010. >> >> Aborting... >> >> > java.io.IOException: Error Recovery for block >> >> blk_1315316969665710488_29842654 failed because recovery from primary >> >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was >> 10.18.0.16:50010. >> >> Aborting... >> >> > at >> >> >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841) >> >> > at >> >> >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305) >> >> > >> >> > Other than the above exception causing a region server to die >> >> occasionally everything seems to be working well. >> >> > >> >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop >> >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above >> exception. >> >> We do have ulimit set ( memory unlimited and files 32k ) for the user >> >> running hbase. >> >> > >> >> > Thanks again for your help >> >> > >> >> > Andy >> >> > >> >> > -----Original Message----- >> >> > From: [email protected] [mailto:[email protected]] On Behalf Of >> >> Stack >> >> > Sent: Sunday, April 10, 2011 1:16 PM >> >> > To: [email protected] >> >> > Cc: Andy Sautins >> >> > Subject: Re: DFS stability running HBase and >> >> dfs.datanode.handler.count... >> >> > >> >> > Did you try upping it Andy? Andrew Purtell's recommendation though >> old >> >> would have come of experience. The Intel article reads like sales but >> there >> >> is probably merit to its suggestion. The Cloudera article is more >> unsure >> >> about the effect of upping handlers though it allows "...could be set a >> bit >> >> higher." >> >> > >> >> > I just looked at our prod frontend and its set to 3 still. I don't >> see >> >> your exceptions in our DN log. >> >> > >> >> > What version of hadoop? You say hbase 0.91. You mean 0.90.1? >> >> > >> >> > ulimit and nproc are set sufficiently high for hadoop/hbase user? >> >> > >> >> > If you grep 163126943925471435_28809750 in namenode log, do you see a >> >> delete occur before a later open? >> >> > >> >> > St.Ack >> >> > >> >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins < >> >> [email protected]> wrote: >> >> >> >> >> >> I ran across an mailing list posting from 1/4/2009 that seemed to >> >> indicate increasing dfs.datanode.handler.count could help improve DFS >> >> stability ( >> >> >> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E). >> The posting seems to indicate the wiki was updated, but I don't seen >> >> anything in the wiki about increasing dfs.datanode.handler.count. I >> have >> >> seen a few other notes that seem to show examples that have raised >> >> dfs.datanode.handler.count including one from an IBM article ( >> >> >> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) >> and the Pro Hadoop book, but other than that the only other mention I see >> >> is from cloudera seems luke-warm on increasing >> dfs.datanode.handler.count ( >> >> >> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ >> ). >> >> >> >> >> >> Given the post is from 2009 I thought I'd ask if anyone has had >> any >> >> success improving stability of HBase/DFS when increasing >> >> dfs.datanode.handler.count. The specific error we are seeing somewhat >> >> frequently ( few hundred times per day ) in the datanode longs is as >> >> follows: >> >> >> >> >> >> 2011-04-09 00:12:48,035 ERROR >> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> >> >> DatanodeRegistration(10.18.0.33:50010, >> >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, >> >> >> infoPort=50075, ipcPort=50020):DataXceiver >> >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not >> >> valid. >> >> >> >> >> >> The above seems to correspond to ClosedChannelExceptions in the >> hbase >> >> regionserver logs as well as some warnings about long write to hlog ( >> some >> >> in the 50+ seconds ). >> >> >> >> >> >> The biggest end-user facing issue we are seeing is that Task >> Trackers >> >> keep getting blacklisted. It's quite possible our problem is unrelated >> to >> >> anything HBase, but I thought it was worth asking given what we've been >> >> seeing. >> >> >> >> >> >> We are currently running 0.91 on an 18 node cluster with ~3k total >> >> regions and each region server is running with 2G of memory. >> >> >> >> >> >> Any insight would be appreciated. >> >> >> >> >> >> Thanks >> >> >> >> >> >> Andy >> >> >> >> >> > >> >> >> > >> >
