Re: Major Compaction Causes Cluster Failure

Stack Fri, 17 Sep 2010 09:59:08 -0700

On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]> wrote:
> Hi all -
>
> A couple of nights ago I enabled cron jobs to run major compactions against
> a few of the tables that I use in HBase.  This has caused multiple worker
> machines on the cluster to fail.  Based on the compaction or losing the
> worker nodes, many of the regions are stuck in transition with a state of
> PENDING_CLOSE.  I believe resetting HBase master will solve that, which will
> do after a few of the current processes finish.  What is the risk for losing
> the regions stuck in transition?  (Running HBase .20.5)
>


Restart master should address this.

If you want to figure how we got into this state, grab the region name
that shows as PENDING_CLOSE.   Grep it in master log.  Find which host
its on.  Go to that host.  Check what its up too.  Is it one of the
hosts that had trouble talking to HDFS?   Or was it one of the regions
that shut down?

My guess is that the regionserver crashed while CLOSE was in flight.

> I am concerned about not being able to successfully run compactions on our
> cluster.  It was my understanding that major compactions happened
> automatically around every 24 hours, so I'm surprised forcing the process to
> happen caused issues.

Can you check your logs to see if they were actually running?  Maybe
they weren't.  See HBASE-2990.

Any suggestions on how to start debugging the issue,
> or what settings to look at?  Starting to dig through logs shows that HBase
> couldn't access HDFS on the same box. (Log Below)
>

It'd be good to see more log from around the shutdown of a regionserver.

You upped ulimit and xceiver count?

You think the loading from mass compaction overloaded HDFS?

How big is your table?

> Current running a cluster with 40 workers, a dedicated jobtracker box, and
> namenode/hbase master.
>
> The cron call that was caused the issue:
> 0 2 * * * echo "major_compact 'hbase_table' " |
> /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1
>

This looks good.   If table large and it hasn't compacted in a while,
you are going to put a big load on your HDFS.

> 2010-09-16 20:37:12,917 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started.  Attempting to free 125488064 bytes
> 2010-09-16 20:37:12,952 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed. Freed 115363256 bytes.  Priority Sizes: Single=294.45364MB
> (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808)
> 2010-09-16 20:37:29,011 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started.  Attempting to free 125542912 bytes
> 2010-09-16 20:37:29,040 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed. Freed 115365552 bytes.  Priority Sizes: Single=333.65866MB
> (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808)
> 2010-09-16 20:37:39,626 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB
> (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883,
> Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit
> Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%,
> Evicted/Run=2569.053955078125
> 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /11.11.11.11 :50010 for file
> /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block
> -7375581532956939954:java.io.EOFException



Check the datanode log on 11.11.11.11 from around this time (nice IP
by the way).  You could also grep 7375581532956939954 in the namenode
logs.  It can be revealing.

Thanks,
St.Ack


> at java.io.DataInputStream.readShort(DataInputStream.java:298)
> at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830)
>
>
> Thanks.
>

Re: Major Compaction Causes Cluster Failure

Reply via email to