Re: Yet Another LeaseException :-(

Igal Shilman Tue, 22 May 2012 16:04:49 -0700

Will do
On May 22, 2012 11:48 PM, "Ted Yu" <[email protected]> wrote:


> Makes sense.
> Do you mind opening a JIRA for adding debug log ?
>
> On Tue, May 22, 2012 at 1:42 PM, Igal Shilman <[email protected]> wrote:
>
> > Hi Ted,
> >
> > Thank you for your reply, I've followed your advice, and added a log
> > message in the catch block.
> > I've been trying to reproduce the problem (tried running sparse scans,
> long
> > job etc'), and it didn't happen yet.
> >
> > I think that adding a log message there (even at debug level) might be
> > useful in other scenarios as well, since some scenarios might silently
> drop
> > previous exceptions as well (some paths in translateException result with
> > an exception thrown)
> >
> > Thanks,
> > Igal.
> >
> > On Mon, May 21, 2012 at 7:46 PM, Ted Yu <[email protected]> wrote:
> >
> > > Thanks for the analysis.
> > >
> > > It shouldn't be difficult to verify your hypothesis.
> > > In the following code:
> > >        } catch (Throwable t) {
> > >
> >         t = translateException(t);
> > >          exceptions.add(t);
> > > You can add a log to show the type of t along with information about
> > > callable.
> > >
> > > When LeaseException happens again, it would be easier to correlate
> logs.
> > >
> > > Cheers
> > >
> > > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > We've noticed in our production cluster (0.90.4-cdh3u3) that from
> time
> > to
> > > > time some of our map tasks fail due to a LeaseException thrown while
> > > > scanning.
> > > >
> > > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout"
> both
> > > set
> > > > to 5 minutes.
> > > >
> > > > Whats strange about this, is the sequences of events that cause the
> > maps
> > > to
> > > > fail:
> > > > (Relevant log parts are here: http://pastebin.com/d1yckmz6)
> > > >
> > > > (a) a client calls next(69901879722105864, 100)
> > > > (b) HRegionServer:next tries to call removeLease(69901879722105864)
> > and a
> > > > LeaseException is thrown (lease 69901879722105864 does not exists.)
> > > > (c) few milliseconds later the mapper logs the same error, and
> > terminates
> > > > immediately.
> > > > (d) A minute later we see that the RegionServer$Responder.doRespond
> > fails
> > > > because the stream is closed (our client has died a minute ago)
> > > > (e) Five minutes later (=our lease period) RegionServer's log shows:
> > > > Scanner 69901879722105864 lease expired.
> > > >
> > > > Now that seems pretty odd, especially that (b) happened 5 minutes
> > before
> > > > (e)
> > > >
> > > > This might be possible, IMHO in the following scenario:
> > > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is
> > > > passed to getRegionServerWithRetries
> > > >
> > > > 2. RS accepts it, enters next(69901879722105864, 100), and removes
> the
> > > > lease assosicated with "69901879722105864".
> > > >
> > > > 3 meanwhile getRegionServerWithRetries catches an exception that is
> not
> > > of
> > > > type DoNotRetryIOException (perhaps socket timeout?) while waiting
> for
> > > this
> > > > callable to complete.
> > > >
> > > > getRegionServerWithRetries just silently adds this to a list of
> > > exceptions.
> > > >
> > > > 4. Then a retry causes (b), and then a rethrow of a LeaseException
> > > (masking
> > > > any previous exceptions that were accumulated in (3)).
> > > >
> > > > Is this scenario seems possible to anyone?
> > > >
> > > > Thanks,
> > > > Igal.
> > > >
> > >
> >
>

Re: Yet Another LeaseException :-(

Reply via email to