On Thu, Sep 23, 2010 at 10:10 AM, Scott Whitecross <[email protected]>wrote:
> > > On Fri, Sep 17, 2010 at 12:58 PM, Stack <[email protected]> wrote: > >> On Fri, Sep 17, 2010 at 9:14 AM, Scott Whitecross <[email protected]> >> wrote: >> > Hi all - >> > >> > A couple of nights ago I enabled cron jobs to run major compactions >> against >> > a few of the tables that I use in HBase. This has caused multiple >> worker >> > machines on the cluster to fail. Based on the compaction or losing the >> > worker nodes, many of the regions are stuck in transition with a state >> of >> > PENDING_CLOSE. I believe resetting HBase master will solve that, which >> will >> > do after a few of the current processes finish. What is the risk for >> losing >> > the regions stuck in transition? (Running HBase .20.5) >> > >> >> Restart master should address this. > > >> If you want to figure how we got into this state, grab the region name >> that shows as PENDING_CLOSE. Grep it in master log. Find which host >> its on. Go to that host. Check what its up too. Is it one of the >> hosts that had trouble talking to HDFS? Or was it one of the regions >> that shut down? >> >> My guess is that the regionserver crashed while CLOSE was in flight. >> > > Thanks. Restarting the master and loading back in missing files seemed to > solve the immediate HBase problems. > > >> >> > I am concerned about not being able to successfully run compactions on >> our >> > cluster. It was my understanding that major compactions happened >> > automatically around every 24 hours, so I'm surprised forcing the >> process to >> > happen caused issues. >> >> Can you check your logs to see if they were actually running? Maybe >> they weren't. See HBASE-2990. >> > > I haven't had a chance yet to search the logs. The master will note a > major compaction is starting? > > >> >> Any suggestions on how to start debugging the issue, >> > or what settings to look at? Starting to dig through logs shows that >> HBase >> > couldn't access HDFS on the same box. (Log Below) >> > >> >> It'd be good to see more log from around the shutdown of a regionserver. >> >> You upped ulimit and xceiver count? >> > > How high can these counts go? Is is possible to determine what the ideal > xciever count for different sized clusters can be? > > Right now the ulimit count is at 100000 and xcievers at 4096, though we may update this as we're seeing errors with some jobs. > >> You think the loading from mass compaction overloaded HDFS? >> >> How big is your table? >> > > I think we may have had issues running another intensive job at the same > time. I did up the xciever count as well, which seems to have solved some > of the issues. > > >> >> > Current running a cluster with 40 workers, a dedicated jobtracker box, >> and >> > namenode/hbase master. >> > >> > The cron call that was caused the issue: >> > 0 2 * * * echo "major_compact 'hbase_table' " | >> > /usr/lib/hbase-0.20/bin/hbase shell >> /tmp/hbase_table 2>&1 >> > >> >> This looks good. If table large and it hasn't compacted in a while, >> you are going to put a big load on your HDFS. >> > > Are there estimates for how long compactions should take? > > >> >> > 2010-09-16 20:37:12,917 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> > started. Attempting to free 125488064 bytes >> > 2010-09-16 20:37:12,952 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> > completed. Freed 115363256 bytes. Priority Sizes: Single=294.45364MB >> > (308757016), Multi=488.6598MB (512396944),Memory=224.37555MB (235274808) >> > 2010-09-16 20:37:29,011 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> > started. Attempting to free 125542912 bytes >> > 2010-09-16 20:37:29,040 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> > completed. Freed 115365552 bytes. Priority Sizes: Single=333.65866MB >> > (349866464), Multi=449.44424MB (471276440),Memory=224.37555MB >> (235274808) >> > 2010-09-16 20:37:39,626 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >> > Total=951.4796MB (997698720), Free=245.1954MB (257106016), >> Max=1196.675MB >> > (1254804736), Counts: Blocks=38388, Access=5559267, Hit=4006883, >> > Miss=1552384, Evictions=260, Evicted=667954, Ratios: Hit >> > Ratio=72.07574248313904%, Miss Ratio=27.924257516860962%, >> > Evicted/Run=2569.053955078125 >> > 2010-09-16 20:37:59,037 WARN org.apache.hadoop.hdfs.DFSClient: Failed to >> > connect to /11.11.11.11 :50010 for file >> > /hbase/my_hbase_table/1606901662/my_family/3686634885331153450 for block >> > -7375581532956939954:java.io.EOFException >> >> >> >> Check the datanode log on 11.11.11.11 from around this time (nice IP >> by the way). You could also grep 7375581532956939954 in the namenode >> logs. It can be revealing. >> >> Thanks, >> St.Ack >> >> >> > at java.io.DataInputStream.readShort(DataInputStream.java:298) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1373) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1830) >> > >> > >> > Thanks. >> > >> > >
