Hi all - A couple of nights ago I enabled cron jobs to run major compactions against a few of the tables that I use in HBase. This has caused multiple worker machines on the cluster to fail. Based on the compaction or losing the worker nodes, many of the regions are stuck in transition with a state of PENDING_CLOSE. I believe resetting HBase master will solve that, which will do after a few of the current processes finish. What is the risk for losing the regions stuck in transition? (Running HBase .20.5)
I am concerned about not being able to successfully run compactions on our cluster. It was my understanding that major compactions happened automatically around every 24 hours, so I'm surprised forcing the process to happen caused issues. Any suggestions on how to start debugging the issue, or what settings to look at? Starting to dig through logs shows that HBase couldn't access HDFS on the same box. (Log Below) Current running a cluster with 40 workers, a dedicated jobtracker box, and namenode/hbase master. The cron call that was caused the issue: 0 2 * * * echo "major_compact 'hbase_table' " | /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1 2010-09-16 20:37:12,917 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started. Attempting to free 125488064 bytes 2010-09-16 20:37:12,952 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed. Freed 115363256 bytes. Priority Sizes: Single=294.45364MB (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808) 2010-09-16 20:37:29,011 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started. Attempting to free 125542912 bytes 2010-09-16 20:37:29,040 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed. Freed 115365552 bytes. Priority Sizes: Single=333.65866MB (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808) 2010-09-16 20:37:39,626 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883, Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%, Evicted/Run=2569.053955078125 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /11.11.11.11:50010 for file /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block -7375581532956939954:java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830) Thanks.
