Re: Major Compaction Causes Cluster Failure

Scott Whitecross Thu, 23 Sep 2010 07:52:03 -0700

On Thu, Sep 23, 2010 at 10:10 AM, Scott Whitecross <[email protected]>wrote:


>
>
> On Fri, Sep 17, 2010 at 12:58 PM, Stack <[email protected]> wrote:
>
>> On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]>
>> wrote:
>> > Hi all -
>> >
>> > A couple of nights ago I enabled cron jobs to run major compactions
>> against
>> > a few of the tables that I use in HBase.  This has caused multiple
>> worker
>> > machines on the cluster to fail.  Based on the compaction or losing the
>> > worker nodes, many of the regions are stuck in transition with a state
>> of
>> > PENDING_CLOSE.  I believe resetting HBase master will solve that, which
>> will
>> > do after a few of the current processes finish.  What is the risk for
>> losing
>> > the regions stuck in transition?  (Running HBase .20.5)
>> >
>>
>> Restart master should address this.
>
>
>> If you want to figure how we got into this state, grab the region name
>> that shows as PENDING_CLOSE.   Grep it in master log.  Find which host
>> its on.  Go to that host.  Check what its up too.  Is it one of the
>> hosts that had trouble talking to HDFS?   Or was it one of the regions
>> that shut down?
>>
>> My guess is that the regionserver crashed while CLOSE was in flight.
>>
>
> Thanks.  Restarting the master and loading back in missing files seemed to
> solve the immediate HBase problems.
>
>
>>
>> > I am concerned about not being able to successfully run compactions on
>> our
>> > cluster.  It was my understanding that major compactions happened
>> > automatically around every 24 hours, so I'm surprised forcing the
>> process to
>> > happen caused issues.
>>
>> Can you check your logs to see if they were actually running?  Maybe
>> they weren't.  See HBASE-2990.
>>
>
> I haven't had a chance yet to search the logs.  The master will note a
> major compaction is starting?
>
>
>>
>> Any suggestions on how to start debugging the issue,
>> > or what settings to look at?  Starting to dig through logs shows that
>> HBase
>> > couldn't access HDFS on the same box. (Log Below)
>> >
>>
>> It'd be good to see more log from around the shutdown of a regionserver.
>>
>> You upped ulimit and xceiver count?
>>
>
> How high can these counts go?  Is is possible to determine what the ideal
> xciever count for different sized clusters can be?
>
>

Right now the ulimit count is at 100000 and xcievers at 4096, though we may
update this as we're seeing errors with some jobs.


>
>> You think the loading from mass compaction overloaded HDFS?
>>
>> How big is your table?
>>
>
> I think we may have had issues running another intensive job at the same
> time.  I did up the xciever count as well, which seems to have solved some
> of the issues.
>
>
>>
>> > Current running a cluster with 40 workers, a dedicated jobtracker box,
>> and
>> > namenode/hbase master.
>> >
>> > The cron call that was caused the issue:
>> > 0 2 * * * echo "major_compact 'hbase_table' " |
>> > /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1
>> >
>>
>> This looks good.   If table large and it hasn't compacted in a while,
>> you are going to put a big load on your HDFS.
>>
>
> Are there estimates for how long compactions should take?
>
>
>>
>> > 2010-09-16 20:37:12,917 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> > started.  Attempting to free 125488064 bytes
>> > 2010-09-16 20:37:12,952 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> > completed. Freed 115363256 bytes.  Priority Sizes: Single=294.45364MB
>> > (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808)
>> > 2010-09-16 20:37:29,011 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> > started.  Attempting to free 125542912 bytes
>> > 2010-09-16 20:37:29,040 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
>> > completed. Freed 115365552 bytes.  Priority Sizes: Single=333.65866MB
>> > (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB
>> (235274808)
>> > 2010-09-16 20:37:39,626 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> > Total=951.4796MB (997698720), Free=245.1954MB (257106016),
>> Max=1196.675MB
>> > (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883,
>> > Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit
>> > Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%,
>> > Evicted/Run=2569.053955078125
>> > 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>> > connect to /11.11.11.11 :50010 for file
>> > /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block
>> > -7375581532956939954:java.io.EOFException
>>
>>
>>
>> Check the datanode log on 11.11.11.11 from around this time (nice IP
>> by the way).  You could also grep 7375581532956939954 in the namenode
>> logs.  It can be revealing.
>>
>> Thanks,
>> St.Ack
>>
>>
>> > at java.io.DataInputStream.readShort(DataInputStream.java:298)
>> > at
>> >
>> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373)
>> > at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830)
>> >
>> >
>> > Thanks.
>> >
>>
>
>

Re: Major Compaction Causes Cluster Failure

Reply via email to