JIRA is down so I cannot see Igal's patch.

Normally we start with patch for trunk. After review passes, backport is
conducted.

Cheers

On Tue, May 22, 2012 at 1:48 PM, Ted Yu <[email protected]> wrote:

> Makes sense.
> Do you mind opening a JIRA for adding debug log ?
>
>
> On Tue, May 22, 2012 at 1:42 PM, Igal Shilman <[email protected]> wrote:
>
>> Hi Ted,
>>
>> Thank you for your reply, I've followed your advice, and added a log
>> message in the catch block.
>> I've been trying to reproduce the problem (tried running sparse scans,
>> long
>> job etc'), and it didn't happen yet.
>>
>> I think that adding a log message there (even at debug level) might be
>> useful in other scenarios as well, since some scenarios might silently
>> drop
>> previous exceptions as well (some paths in translateException result with
>> an exception thrown)
>>
>> Thanks,
>> Igal.
>>
>> On Mon, May 21, 2012 at 7:46 PM, Ted Yu <[email protected]> wrote:
>>
>> > Thanks for the analysis.
>> >
>> > It shouldn't be difficult to verify your hypothesis.
>> > In the following code:
>> >        } catch (Throwable t) {
>> >
>>         t = translateException(t);
>> >          exceptions.add(t);
>> > You can add a log to show the type of t along with information about
>> > callable.
>> >
>> > When LeaseException happens again, it would be easier to correlate logs.
>> >
>> > Cheers
>> >
>> > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <[email protected]> wrote:
>> >
>> > > Hi,
>> > >
>> > > We've noticed in our production cluster (0.90.4-cdh3u3) that from
>> time to
>> > > time some of our map tasks fail due to a LeaseException thrown while
>> > > scanning.
>> > >
>> > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout"
>> both
>> > set
>> > > to 5 minutes.
>> > >
>> > > Whats strange about this, is the sequences of events that cause the
>> maps
>> > to
>> > > fail:
>> > > (Relevant log parts are here: http://pastebin.com/d1yckmz6)
>> > >
>> > > (a) a client calls next(69901879722105864, 100)
>> > > (b) HRegionServer:next tries to call removeLease(69901879722105864)
>> and a
>> > > LeaseException is thrown (lease 69901879722105864 does not exists.)
>> > > (c) few milliseconds later the mapper logs the same error, and
>> terminates
>> > > immediately.
>> > > (d) A minute later we see that the RegionServer$Responder.doRespond
>> fails
>> > > because the stream is closed (our client has died a minute ago)
>> > > (e) Five minutes later (=our lease period) RegionServer's log shows:
>> > > Scanner 69901879722105864 lease expired.
>> > >
>> > > Now that seems pretty odd, especially that (b) happened 5 minutes
>> before
>> > > (e)
>> > >
>> > > This might be possible, IMHO in the following scenario:
>> > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is
>> > > passed to getRegionServerWithRetries
>> > >
>> > > 2. RS accepts it, enters next(69901879722105864, 100), and removes the
>> > > lease assosicated with "69901879722105864".
>> > >
>> > > 3 meanwhile getRegionServerWithRetries catches an exception that is
>> not
>> > of
>> > > type DoNotRetryIOException (perhaps socket timeout?) while waiting for
>> > this
>> > > callable to complete.
>> > >
>> > > getRegionServerWithRetries just silently adds this to a list of
>> > exceptions.
>> > >
>> > > 4. Then a retry causes (b), and then a rethrow of a LeaseException
>> > (masking
>> > > any previous exceptions that were accumulated in (3)).
>> > >
>> > > Is this scenario seems possible to anyone?
>> > >
>> > > Thanks,
>> > > Igal.
>> > >
>> >
>>
>
>

Reply via email to