Sounds like there's an underlying HDFS issue, you should check those machines' datanode logs at the time of the failure for any exception.
J-D On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]> wrote: > Hi all - > > A couple of nights ago I enabled cron jobs to run major compactions against > a few of the tables that I use in HBase. This has caused multiple worker > machines on the cluster to fail. Based on the compaction or losing the > worker nodes, many of the regions are stuck in transition with a state of > PENDING_CLOSE. I believe resetting HBase master will solve that, which will > do after a few of the current processes finish. What is the risk for losing > the regions stuck in transition? (Running HBase .20.5) > > I am concerned about not being able to successfully run compactions on our > cluster. It was my understanding that major compactions happened > automatically around every 24 hours, so I'm surprised forcing the process to > happen caused issues. Any suggestions on how to start debugging the issue, > or what settings to look at? Starting to dig through logs shows that HBase > couldn't access HDFS on the same box. (Log Below) > > Current running a cluster with 40 workers, a dedicated jobtracker box, and > namenode/hbase master. > > The cron call that was caused the issue: > 0 2 * * * echo "major_compact 'hbase_table' " | > /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1 > > 2010-09-16 20:37:12,917 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started. Attempting to free 125488064 bytes > 2010-09-16 20:37:12,952 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed. Freed 115363256 bytes. Priority Sizes: Single=294.45364MB > (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808) > 2010-09-16 20:37:29,011 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started. Attempting to free 125542912 bytes > 2010-09-16 20:37:29,040 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed. Freed 115365552 bytes. Priority Sizes: Single=333.65866MB > (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808) > 2010-09-16 20:37:39,626 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB > (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883, > Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit > Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%, > Evicted/Run=2569.053955078125 > 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to > connect to /11.11.11.11:50010 for file > /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block > -7375581532956939954:java.io.EOFException > at java.io.DataInputStream.readShort(DataInputStream.java:298) > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830) > > > Thanks. >
