There was 47 second gap in region server log (where the calls to subList() might have happened):
1. 2014-05-29 19:09:02,257 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactSelection: Deleting the expired store file by compaction: hdfs://cluster/hbase/IngestProcessing/bf754ed8764ca705a2acc0058e13b69c/data/22b41ad9388f488cb672cca3de0614e9 whose maxTimeStamp is -1 while the max expired timestamp is 1401318542257 2. 2014-05-29 19:09:49,324 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -6708632874853984071 lease expired on region WorldcatCrossref,4333961705,1334582131683.90c82e6c71dd99f21a18df41df28e5b0. Good practice would be, instead of assigning subList() to the same member variable, to clear the sublist which is not needed. Cheers On Fri, May 30, 2014 at 9:52 AM, Andrew Purtell <[email protected]> wrote: > Maybe we can kill the zookeeper connection in the abort handler. > > > On Fri, May 30, 2014 at 9:38 AM, Buckley,Ron <[email protected]> wrote: > > > Thanks Ted. I should have seen that. > > > > I finally had to 'kill -9' the rs, as I couldnt get it to shut down any > > other way. > > > > It seems like, the Region Server shouldnt have kept telling ZooKeeper > that > > all was well, even though it was trying to abort with a fatal error. > > > > > > -----Original Message----- > > From: Ted Yu [mailto:[email protected]] > > Sent: Friday, May 30, 2014 12:11 PM > > To: [email protected] > > Subject: Re: Region Server hung during shutdown after StackOverflow error > > > > Looking at the StackOverflowError in pastebin, the cause was too many > > calls to subList(). > > J-D fixed one similar bug in HBASE-10312 > > > > I searched for '\.subList(' in 0.94 codebase but haven't pinpointed which > > class was the source of such calls. > > > > Will dig deeper when I have time. > > > > Cheers > > > > > > On Fri, May 30, 2014 at 8:24 AM, Buckley,Ron <[email protected]> wrote: > > > > > Interesting case happened out dev HBase cluster overnight. (We're > > > running HBase 0.94.15 from CDH 4.6.0) > > > > > > A region server took a StackOverflow error, it looks like during > > > during a minor compaction. > > > > > > The region server is trying to shut down with a Fatal, but is now hung > > > during shutdown. > > > > > > The particularly troublesome thing is that the RS is alive enough to > > > keep zookeeper happy. > > > > > > So, the regions arent moving off, but our apps cant get to them > > > because the RS is mostly dead. > > > > > > I put some of the details on pastebin. > > > > > > JStack -> http://pastebin.com/hnLtaG54 Outfile -> > > > http://pastebin.com/5F1UcGjg Logfile -> http://pastebin.com/TBL1YSZM > > > > > > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
