Yet Another LeaseException :-(

Igal Shilman Mon, 21 May 2012 07:35:39 -0700

Hi,

We've noticed in our production cluster (0.90.4-cdh3u3) that from time to
time some of our map tasks fail due to a LeaseException thrown while
scanning.


We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both set
to 5 minutes.

Whats strange about this, is the sequences of events that cause the maps to
fail:
(Relevant log parts are here: http://pastebin.com/d1yckmz6)

(a) a client calls next(69901879722105864, 100)
(b) HRegionServer:next tries to call removeLease(69901879722105864) and a
LeaseException is thrown (lease 69901879722105864 does not exists.)
(c) few milliseconds later the mapper logs the same error, and terminates
immediately.
(d) A minute later we see that the RegionServer$Responder.doRespond fails
because the stream is closed (our client has died a minute ago)
(e) Five minutes later (=our lease period) RegionServer's log shows:
Scanner 69901879722105864 lease expired.

Now that seems pretty odd, especially that (b) happened 5 minutes before (e)

This might be possible, IMHO in the following scenario:
1. A ScannerCallable wishing to call: next(69901879722105864, 100) is
passed to getRegionServerWithRetries

2. RS accepts it, enters next(69901879722105864, 100), and removes the
lease assosicated with "69901879722105864".

3 meanwhile getRegionServerWithRetries catches an exception that is not of
type DoNotRetryIOException (perhaps socket timeout?) while waiting for this
callable to complete.

getRegionServerWithRetries just silently adds this to a list of exceptions.

4. Then a retry causes (b), and then a rethrow of a LeaseException (masking
any previous exceptions that were accumulated in (3)).

Is this scenario seems possible to anyone?

Thanks,
Igal.

Yet Another LeaseException :-(

Reply via email to