Re: Major Compaction Causes Cluster Failure

Scott Whitecross Thu, 23 Sep 2010 07:11:01 -0700

On Fri, Sep 17, 2010 at 12:58 PM, Stack <[email protected]> wrote:

> On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]>
> wrote:
> > Hi all -
> >
> > A couple of nights ago I enabled cron jobs to run major compactions
> against
> > a few of the tables that I use in HBase.  This has caused multiple worker
> > machines on the cluster to fail.  Based on the compaction or losing the
> > worker nodes, many of the regions are stuck in transition with a state of
> > PENDING_CLOSE.  I believe resetting HBase master will solve that, which
> will
> > do after a few of the current processes finish.  What is the risk for
> losing
> > the regions stuck in transition?  (Running HBase .20.5)
> >
>
> Restart master should address this.



> If you want to figure how we got into this state, grab the region name
> that shows as PENDING_CLOSE.   Grep it in master log.  Find which host
> its on.  Go to that host.  Check what its up too.  Is it one of the
> hosts that had trouble talking to HDFS?   Or was it one of the regions
> that shut down?
>
> My guess is that the regionserver crashed while CLOSE was in flight.
>

Thanks.  Restarting the master and loading back in missing files seemed to
solve the immediate HBase problems.


>
> > I am concerned about not being able to successfully run compactions on
> our
> > cluster.  It was my understanding that major compactions happened
> > automatically around every 24 hours, so I'm surprised forcing the process
> to
> > happen caused issues.
>
> Can you check your logs to see if they were actually running?  Maybe
> they weren't.  See HBASE-2990.
>

I haven't had a chance yet to search the logs.  The master will note a major
compaction is starting?


>
> Any suggestions on how to start debugging the issue,
> > or what settings to look at?  Starting to dig through logs shows that
> HBase
> > couldn't access HDFS on the same box. (Log Below)
> >
>
> It'd be good to see more log from around the shutdown of a regionserver.
>
> You upped ulimit and xceiver count?
>

How high can these counts go?  Is is possible to determine what the ideal
xciever count for different sized clusters can be?


>
> You think the loading from mass compaction overloaded HDFS?
>
> How big is your table?
>

I think we may have had issues running another intensive job at the same
time.  I did up the xciever count as well, which seems to have solved some
of the issues.


>
> > Current running a cluster with 40 workers, a dedicated jobtracker box,
> and
> > namenode/hbase master.
> >
> > The cron call that was caused the issue:
> > 0 2 * * * echo "major_compact 'hbase_table' " |
> > /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1
> >
>
> This looks good.   If table large and it hasn't compacted in a while,
> you are going to put a big load on your HDFS.
>

Are there estimates for how long compactions should take?


>
> > 2010-09-16 20:37:12,917 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started.  Attempting to free 125488064 bytes
> > 2010-09-16 20:37:12,952 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed. Freed 115363256 bytes.  Priority Sizes: Single=294.45364MB
> > (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808)
> > 2010-09-16 20:37:29,011 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started.  Attempting to free 125542912 bytes
> > 2010-09-16 20:37:29,040 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed. Freed 115365552 bytes.  Priority Sizes: Single=333.65866MB
> > (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808)
> > 2010-09-16 20:37:39,626 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB
> > (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883,
> > Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit
> > Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%,
> > Evicted/Run=2569.053955078125
> > 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /11.11.11.11 :50010 for file
> > /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block
> > -7375581532956939954:java.io.EOFException
>
>
>
> Check the datanode log on 11.11.11.11 from around this time (nice IP
> by the way).  You could also grep 7375581532956939954 in the namenode
> logs.  It can be revealing.
>
> Thanks,
> St.Ack
>
>
> > at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830)
> >
> >
> > Thanks.
> >
>

Re: Major Compaction Causes Cluster Failure

Reply via email to