Major Compaction Causes Cluster Failure

Scott Whitecross Fri, 17 Sep 2010 09:14:37 -0700

Hi all -

A couple of nights ago I enabled cron jobs to run major compactions against
a few of the tables that I use in HBase.  This has caused multiple worker
machines on the cluster to fail.  Based on the compaction or losing the
worker nodes, many of the regions are stuck in transition with a state of
PENDING_CLOSE.  I believe resetting HBase master will solve that, which will
do after a few of the current processes finish.  What is the risk for losing
the regions stuck in transition?  (Running HBase .20.5)


I am concerned about not being able to successfully run compactions on our
cluster.  It was my understanding that major compactions happened
automatically around every 24 hours, so I'm surprised forcing the process to
happen caused issues.  Any suggestions on how to start debugging the issue,
or what settings to look at?  Starting to dig through logs shows that HBase
couldn't access HDFS on the same box. (Log Below)

Current running a cluster with 40 workers, a dedicated jobtracker box, and
namenode/hbase master.

The cron call that was caused the issue:
0 2 * * * echo "major_compact 'hbase_table' " |
/usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1

2010-09-16 20:37:12,917 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started.  Attempting to free 125488064 bytes
2010-09-16 20:37:12,952 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed. Freed 115363256 bytes.  Priority Sizes: Single=294.45364MB
(308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808)
2010-09-16 20:37:29,011 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started.  Attempting to free 125542912 bytes
2010-09-16 20:37:29,040 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed. Freed 115365552 bytes.  Priority Sizes: Single=333.65866MB
(349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808)
2010-09-16 20:37:39,626 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB
(1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883,
Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit
Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%,
Evicted/Run=2569.053955078125
2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /11.11.11.11:50010 for file
/hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block
-7375581532956939954:java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830)


Thanks.

Major Compaction Causes Cluster Failure

Reply via email to