On Fri, Sep 17, 2010 at 12:58 PM, Stack <[email protected]> wrote: > On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]> > wrote: > > Hi all - > > > > A couple of nights ago I enabled cron jobs to run major compactions > against > > a few of the tables that I use in HBase. This has caused multiple worker > > machines on the cluster to fail. Based on the compaction or losing the > > worker nodes, many of the regions are stuck in transition with a state of > > PENDING_CLOSE. I believe resetting HBase master will solve that, which > will > > do after a few of the current processes finish. What is the risk for > losing > > the regions stuck in transition? (Running HBase .20.5) > > > > Restart master should address this.
> If you want to figure how we got into this state, grab the region name > that shows as PENDING_CLOSE. Grep it in master log. Find which host > its on. Go to that host. Check what its up too. Is it one of the > hosts that had trouble talking to HDFS? Or was it one of the regions > that shut down? > > My guess is that the regionserver crashed while CLOSE was in flight. > Thanks. Restarting the master and loading back in missing files seemed to solve the immediate HBase problems. > > > I am concerned about not being able to successfully run compactions on > our > > cluster. It was my understanding that major compactions happened > > automatically around every 24 hours, so I'm surprised forcing the process > to > > happen caused issues. > > Can you check your logs to see if they were actually running? Maybe > they weren't. See HBASE-2990. > I haven't had a chance yet to search the logs. The master will note a major compaction is starting? > > Any suggestions on how to start debugging the issue, > > or what settings to look at? Starting to dig through logs shows that > HBase > > couldn't access HDFS on the same box. (Log Below) > > > > It'd be good to see more log from around the shutdown of a regionserver. > > You upped ulimit and xceiver count? > How high can these counts go? Is is possible to determine what the ideal xciever count for different sized clusters can be? > > You think the loading from mass compaction overloaded HDFS? > > How big is your table? > I think we may have had issues running another intensive job at the same time. I did up the xciever count as well, which seems to have solved some of the issues. > > > Current running a cluster with 40 workers, a dedicated jobtracker box, > and > > namenode/hbase master. > > > > The cron call that was caused the issue: > > 0 2 * * * echo "major_compact 'hbase_table' " | > > /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1 > > > > This looks good. If table large and it hasn't compacted in a while, > you are going to put a big load on your HDFS. > Are there estimates for how long compactions should take? > > > 2010-09-16 20:37:12,917 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started. Attempting to free 125488064 bytes > > 2010-09-16 20:37:12,952 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed. Freed 115363256 bytes. Priority Sizes: Single=294.45364MB > > (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808) > > 2010-09-16 20:37:29,011 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started. Attempting to free 125542912 bytes > > 2010-09-16 20:37:29,040 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed. Freed 115365552 bytes. Priority Sizes: Single=333.65866MB > > (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB (235274808) > > 2010-09-16 20:37:39,626 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=951.4796MB (997698720), Free=245.1954MB (257106016), Max=1196.675MB > > (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883, > > Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit > > Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%, > > Evicted/Run=2569.053955078125 > > 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to > > connect to /11.11.11.11 :50010 for file > > /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block > > -7375581532956939954:java.io.EOFException > > > > Check the datanode log on 11.11.11.11 from around this time (nice IP > by the way). You could also grep 7375581532956939954 in the namenode > logs. It can be revealing. > > Thanks, > St.Ack > > > > at java.io.DataInputStream.readShort(DataInputStream.java:298) > > at > > > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830) > > > > > > Thanks. > > >
