Alright so we think we figured it out: https://issues.apache.org/jira/browse/HBASE-4462
In short the SocketTimeoutExceptions are getting retried which creates multiple calls to the same scanner, making things even worse than they already are. J-D On Mon, Sep 19, 2011 at 11:15 AM, Jean-Daniel Cryans <[email protected]> wrote: > There's something odd with your jstack, most of the locks id are > missing... anyways there's one I can trace: > > "IPC Server handler 8 on 60020" daemon prio=10 tid=aaabcc31800 > nid=0x3219 waiting for monitor entry [0x0000000044c8e000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322) > - waiting to lock <fca7e28> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1823) > > "IPC Server handler 6 on 60020" daemon prio=10 tid=aaabc9a5000 > nid=0x3217 runnable [0x0000000044a8c000] > ... > - locked <fca7e28> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322) > - locked <fca7e28> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner) > > > It clearly shows that two handlers are trying to use the same > RegionScanner object. It would be nice to have a stack dump with > correct lock information tho... > > Regarding your code, would it be possible to see the unaltered > version? Feel free to send it directly to me, and if I do find > something I'll post the findings back here. > > Thanks, > > J-D > > On Fri, Sep 16, 2011 at 10:58 PM, Douglas Campbell <[email protected]> wrote: >> Answers below. >> >> >> >> ________________________________ >> From: Jean-Daniel Cryans <[email protected]> >> To: [email protected] >> Sent: Friday, September 16, 2011 2:08 PM >> Subject: Re: REcovering from SocketTimeout during scan in 90.3 >> >> On Fri, Sep 16, 2011 at 12:17 PM, Douglas Campbell <[email protected]> >> wrote: >>> The min/max keys are for each region right? Are they pretty big? >>> >>> doug : Typically around 100 keys and each key is 24bytes >> >> A typical region would be like - stores=4, storefiles=4, >> storefileSizeMB=1005, memstoreSizeMB=46, storefileIndexSizeMB=6 >> >> Sorry, I meant to ask how big the regions were, not the rows. >> >>> Are you sharing scanners between multiple threads? >>> >>> doug: no - but each Result from the scan is passed to a thread to merge >>> with input and write back. >> >> Yeah, this really isn't what I'm reading tho... Would it be possible >> to see a full stack trace that contains those BLOCKED threads? (please >> put it in a pastebin) >> >> http://kpaste.net/02f67d >> >>>> I had one or more runs where this error occured and I wasn't taking care >>>> to call scanner.close() >> >> The other thing I was thinking, did you already implement the re-init >> of the scanner? If so, what's the code like? >> >>>>> The code traps runtime exception around the scanner iterator (pseudoish) >> while (toprocess.size() > 0 && !donescanning) { >> Scanner scanner = table.getScanner(buildScan(toprocess)); >> try { >> for (Result r: scanner) { >> toprocess.remove(r.getRow()); >> // fork thread with r >> if (toprocess.size() == 0) donescanning = true; >> } >> } catch (RuntimeException e) { >> scanner.close(); >> if (e.getCause() instanceof IOEXception) { // probably hbase ex >> scanner = getScanner(buildScan(toprocess)); >> } else { >> donescanning = true; >> } >> } >> >>>>> buildScan takes the keys and crams them in the filter. >> >> Thx, >> >> J-D >
