Alright so we think we figured it out:
https://issues.apache.org/jira/browse/HBASE-4462

In short the SocketTimeoutExceptions are getting retried which creates
multiple calls to the same scanner, making things even worse than they
already are.

J-D

On Mon, Sep 19, 2011 at 11:15 AM, Jean-Daniel Cryans
<[email protected]> wrote:
> There's something odd with your jstack, most of the locks id are
> missing... anyways there's one I can trace:
>
> "IPC Server handler 8 on 60020" daemon prio=10 tid=aaabcc31800
> nid=0x3219 waiting for monitor entry [0x0000000044c8e000]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
>        - waiting to lock <fca7e28> (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1823)
>
> "IPC Server handler 6 on 60020" daemon prio=10 tid=aaabc9a5000
> nid=0x3217 runnable [0x0000000044a8c000]
> ...
>        - locked <fca7e28> (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
>        - locked <fca7e28> (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
>
>
> It clearly shows that two handlers are trying to use the same
> RegionScanner object. It would be nice to have a stack dump with
> correct lock information tho...
>
> Regarding your code, would it be possible to see the unaltered
> version? Feel free to send it directly to me, and if I do find
> something I'll post the findings back here.
>
> Thanks,
>
> J-D
>
> On Fri, Sep 16, 2011 at 10:58 PM, Douglas Campbell <[email protected]> wrote:
>> Answers below.
>>
>>
>>
>> ________________________________
>> From: Jean-Daniel Cryans <[email protected]>
>> To: [email protected]
>> Sent: Friday, September 16, 2011 2:08 PM
>> Subject: Re: REcovering from SocketTimeout during scan in 90.3
>>
>> On Fri, Sep 16, 2011 at 12:17 PM, Douglas Campbell <[email protected]> 
>> wrote:
>>> The min/max keys are for each region right? Are they pretty big?
>>>
>>> doug : Typically around 100 keys and each key is 24bytes
>>
>> A typical region would be like - stores=4, storefiles=4, 
>> storefileSizeMB=1005, memstoreSizeMB=46, storefileIndexSizeMB=6
>>
>> Sorry, I meant to ask how big the regions were, not the rows.
>>
>>> Are you sharing scanners between multiple threads?
>>>
>>> doug: no - but each Result from the scan is passed to a thread to merge 
>>> with input and write back.
>>
>> Yeah, this really isn't what I'm reading tho... Would it be possible
>> to see a full stack trace that contains those BLOCKED threads? (please
>> put it in a pastebin)
>>
>> http://kpaste.net/02f67d
>>
>>>> I had one or more runs where this error occured and I wasn't taking care 
>>>> to call scanner.close()
>>
>> The other thing I was thinking, did you already implement the re-init
>> of the scanner? If so, what's the code like?
>>
>>>>> The code traps runtime exception around the scanner iterator (pseudoish)
>> while (toprocess.size() > 0 && !donescanning) {
>>     Scanner scanner = table.getScanner(buildScan(toprocess));
>>     try {
>>         for (Result r: scanner) {
>>          toprocess.remove(r.getRow());
>>          // fork thread with r
>>          if (toprocess.size() == 0) donescanning = true;
>>          }
>>     } catch (RuntimeException e) {
>>         scanner.close();
>>         if (e.getCause() instanceof IOEXception) { // probably hbase ex
>>              scanner =  getScanner(buildScan(toprocess));
>>         } else {
>>              donescanning = true;
>>         }
>>      }
>>
>>>>> buildScan takes the keys and crams them in the filter.
>>
>> Thx,
>>
>> J-D
>

Reply via email to