Then this would be a heap limitation problem, have a look at your GC
log during the compaction.

J-D

On Thu, Apr 14, 2011 at 10:31 AM, Ben Aldrich <[email protected]> wrote:
> Our heapsize is set to 2gb, I think my dev issue was because I was running
> things off of a few vm's. Even though the compaction is in another thread it
> would still fail to respond during major compaction.
>
> -Ben
>
> On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> Ben, the compaction is done in a background thread, it doesn't block
>> anything. Now if you had a heap close to 2GB, you could easily run
>> into issues.
>>
>> J-D
>>
>> On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote:
>> > Just to chime in here, the other thing we changed was our max_file_size
>> is
>> > now set to 2gb instead of 512mb. This could be causing long compaction
>> > times. If a compaction takes too long it won't respond and can be marked
>> as
>> > dead. I have had this happen on my dev cluster a few times.
>> >
>> > -Ben
>> >
>> > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <
>> [email protected]>wrote:
>> >
>> >> This is probably a red herring, for example if the region server had a
>> >> big GC pause then the master could have already split the log and the
>> >> region server wouldn't be able to close it (that's our version of IO
>> >> fencing). So from that exception look back in the log and see if
>> >> there's anything like :
>> >>
>> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
>> >> not heard from server in some_big_number ms
>> >>
>> >> J-D
>> >>
>> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
>> >> <[email protected]> wrote:
>> >> >
>> >> >  Thanks for the response stack.  Yes we tried increasing
>> >> dfs.datanode.handler.count to 8.   At this point I would say it didn't
>> seem
>> >> to resolve the issue we are seeing, but we it also doesn't seem to be
>> >> hurting anything so for right now we're going to leave it in at 8 while
>> we
>> >> continue to debug.
>> >> >
>> >> >  In regard to the original error I posted ( Block 'x' is not valid )
>> we
>> >> have chased that down thanks to your suggestion of looking at the logs
>> for
>> >> the history of the block.  It _looks_ like our 'is not valid' block
>> errors
>> >> are unrelated and due to chmod or deleting mapreduce output directories
>> >> directly after a run.  We are still isolating that but it looks like
>> it's
>> >> not HBase releated so I'll move that to another list.  Thank you very
>> much
>> >> for your debugging suggestions.
>> >> >
>> >> >   The one issue we are still seeing is that we will occasionally have
>> a
>> >> regionserver die with the following exception.  I need to chase that
>> down a
>> >> little more but it seems similar to a post from 2/13/2011 (
>> >> http://www.mail-archive.com/[email protected]/msg05550.html ) that
>> I'm
>> >> not sure was ever resolved or not.  If anyone has any insight on how to
>> >> debug the following error a little more I would appreciate any thoughts
>> you
>> >> might have.
>> >> >
>> >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient:
>> Exception
>> >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
>> >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
>> >> java.io.IOException: Error Recovery for block
>> >> blk_1315316969665710488_29842654 failed  because recovery from primary
>> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
>> 10.18.0.16:50010.
>> >> Aborting...
>> >> > java.io.IOException: Error Recovery for block
>> >> blk_1315316969665710488_29842654 failed  because recovery from primary
>> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
>> 10.18.0.16:50010.
>> >> Aborting...
>> >> >        at
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>> >> >        at
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>> >> >
>> >> > Other than the above exception causing a region server to die
>> >> occasionally everything seems to be working well.
>> >> >
>> >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
>> >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above
>> exception.
>> >>  We do have ulimit set ( memory unlimited and files 32k ) for the user
>> >> running hbase.
>> >> >
>> >> > Thanks again for your help
>> >> >
>> >> >  Andy
>> >> >
>> >> > -----Original Message-----
>> >> > From: [email protected] [mailto:[email protected]] On Behalf Of
>> >> Stack
>> >> > Sent: Sunday, April 10, 2011 1:16 PM
>> >> > To: [email protected]
>> >> > Cc: Andy Sautins
>> >> > Subject: Re: DFS stability running HBase and
>> >> dfs.datanode.handler.count...
>> >> >
>> >> > Did you try upping it Andy?  Andrew Purtell's recommendation though
>> old
>> >> would have come of experience.  The Intel article reads like sales but
>> there
>> >> is probably merit to its suggestion.  The Cloudera article is more
>> unsure
>> >> about the effect of upping handlers though it allows "...could be set a
>> bit
>> >> higher."
>> >> >
>> >> > I just looked at our prod frontend and its set to 3 still.  I don't
>> see
>> >> your exceptions in our DN log.
>> >> >
>> >> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>> >> >
>> >> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
>> >> >
>> >> > If you grep 163126943925471435_28809750 in namenode log, do you see a
>> >> delete occur before a later open?
>> >> >
>> >> > St.Ack
>> >> >
>> >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
>> >> [email protected]> wrote:
>> >> >>
>> >> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
>> >> indicate increasing dfs.datanode.handler.count could help improve DFS
>> >> stability (
>> >>
>> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E).
>>  The posting seems to indicate the wiki was updated, but I don't seen
>> >> anything in the wiki about increasing dfs.datanode.handler.count.   I
>> have
>> >> seen a few other notes that seem to show examples that have raised
>> >> dfs.datanode.handler.count including one from an IBM article (
>> >>
>> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
>> and the Pro Hadoop book, but other than that the only other mention I see
>> >> is from cloudera seems luke-warm on increasing
>> dfs.datanode.handler.count (
>> >>
>> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
>> ).
>> >> >>
>> >> >>    Given the post is from 2009 I thought I'd ask if anyone has had
>> any
>> >> success improving stability of HBase/DFS when increasing
>> >> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>> >>  frequently ( few hundred times per day ) in the datanode longs is as
>> >> follows:
>> >> >>
>> >> >> 2011-04-09 00:12:48,035 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(10.18.0.33:50010,
>> >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
>> >> valid.
>> >> >>
>> >> >>   The above seems to correspond to ClosedChannelExceptions in the
>> hbase
>> >> regionserver logs as well as some warnings about long write to hlog (
>> some
>> >> in the 50+ seconds ).
>> >> >>
>> >> >>    The biggest end-user facing issue we are seeing is that Task
>> Trackers
>> >> keep getting blacklisted.  It's quite possible our problem is unrelated
>> to
>> >> anything HBase, but I thought it was worth asking given what we've been
>> >> seeing.
>> >> >>
>> >> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
>> >> regions and each region server is running with 2G of memory.
>> >> >>
>> >> >>   Any insight would be appreciated.
>> >> >>
>> >> >>   Thanks
>> >> >>
>> >> >>    Andy
>> >> >>
>> >> >
>> >>
>> >
>>
>

Reply via email to