Ben, the compaction is done in a background thread, it doesn't block
anything. Now if you had a heap close to 2GB, you could easily run
into issues.

J-D

On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote:
> Just to chime in here, the other thing we changed was our max_file_size is
> now set to 2gb instead of 512mb. This could be causing long compaction
> times. If a compaction takes too long it won't respond and can be marked as
> dead. I have had this happen on my dev cluster a few times.
>
> -Ben
>
> On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> This is probably a red herring, for example if the region server had a
>> big GC pause then the master could have already split the log and the
>> region server wouldn't be able to close it (that's our version of IO
>> fencing). So from that exception look back in the log and see if
>> there's anything like :
>>
>> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
>> not heard from server in some_big_number ms
>>
>> J-D
>>
>> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
>> <[email protected]> wrote:
>> >
>> >  Thanks for the response stack.  Yes we tried increasing
>> dfs.datanode.handler.count to 8.   At this point I would say it didn't seem
>> to resolve the issue we are seeing, but we it also doesn't seem to be
>> hurting anything so for right now we're going to leave it in at 8 while we
>> continue to debug.
>> >
>> >  In regard to the original error I posted ( Block 'x' is not valid ) we
>> have chased that down thanks to your suggestion of looking at the logs for
>> the history of the block.  It _looks_ like our 'is not valid' block errors
>> are unrelated and due to chmod or deleting mapreduce output directories
>> directly after a run.  We are still isolating that but it looks like it's
>> not HBase releated so I'll move that to another list.  Thank you very much
>> for your debugging suggestions.
>> >
>> >   The one issue we are still seeing is that we will occasionally have a
>> regionserver die with the following exception.  I need to chase that down a
>> little more but it seems similar to a post from 2/13/2011 (
>> http://www.mail-archive.com/[email protected]/msg05550.html ) that I'm
>> not sure was ever resolved or not.  If anyone has any insight on how to
>> debug the following error a little more I would appreciate any thoughts you
>> might have.
>> >
>> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
>> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
>> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
>> java.io.IOException: Error Recovery for block
>> blk_1315316969665710488_29842654 failed  because recovery from primary
>> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
>> Aborting...
>> > java.io.IOException: Error Recovery for block
>> blk_1315316969665710488_29842654 failed  because recovery from primary
>> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
>> Aborting...
>> >        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>> >        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>> >
>> > Other than the above exception causing a region server to die
>> occasionally everything seems to be working well.
>> >
>> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
>> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.
>>  We do have ulimit set ( memory unlimited and files 32k ) for the user
>> running hbase.
>> >
>> > Thanks again for your help
>> >
>> >  Andy
>> >
>> > -----Original Message-----
>> > From: [email protected] [mailto:[email protected]] On Behalf Of
>> Stack
>> > Sent: Sunday, April 10, 2011 1:16 PM
>> > To: [email protected]
>> > Cc: Andy Sautins
>> > Subject: Re: DFS stability running HBase and
>> dfs.datanode.handler.count...
>> >
>> > Did you try upping it Andy?  Andrew Purtell's recommendation though old
>> would have come of experience.  The Intel article reads like sales but there
>> is probably merit to its suggestion.  The Cloudera article is more unsure
>> about the effect of upping handlers though it allows "...could be set a bit
>> higher."
>> >
>> > I just looked at our prod frontend and its set to 3 still.  I don't see
>> your exceptions in our DN log.
>> >
>> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>> >
>> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
>> >
>> > If you grep 163126943925471435_28809750 in namenode log, do you see a
>> delete occur before a later open?
>> >
>> > St.Ack
>> >
>> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
>> [email protected]> wrote:
>> >>
>> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
>> indicate increasing dfs.datanode.handler.count could help improve DFS
>> stability (
>> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E).
>>   The posting seems to indicate the wiki was updated, but I don't seen
>> anything in the wiki about increasing dfs.datanode.handler.count.   I have
>> seen a few other notes that seem to show examples that have raised
>> dfs.datanode.handler.count including one from an IBM article (
>> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
>>  and the Pro Hadoop book, but other than that the only other mention I see
>> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (
>> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/).
>> >>
>> >>    Given the post is from 2009 I thought I'd ask if anyone has had any
>> success improving stability of HBase/DFS when increasing
>> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>>  frequently ( few hundred times per day ) in the datanode longs is as
>> follows:
>> >>
>> >> 2011-04-09 00:12:48,035 ERROR
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> DatanodeRegistration(10.18.0.33:50010,
>> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
>> valid.
>> >>
>> >>   The above seems to correspond to ClosedChannelExceptions in the hbase
>> regionserver logs as well as some warnings about long write to hlog ( some
>> in the 50+ seconds ).
>> >>
>> >>    The biggest end-user facing issue we are seeing is that Task Trackers
>> keep getting blacklisted.  It's quite possible our problem is unrelated to
>> anything HBase, but I thought it was worth asking given what we've been
>> seeing.
>> >>
>> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
>> regions and each region server is running with 2G of memory.
>> >>
>> >>   Any insight would be appreciated.
>> >>
>> >>   Thanks
>> >>
>> >>    Andy
>> >>
>> >
>>
>

Reply via email to