Hi Ted, Thank you for your reply, I've followed your advice, and added a log message in the catch block. I've been trying to reproduce the problem (tried running sparse scans, long job etc'), and it didn't happen yet.
I think that adding a log message there (even at debug level) might be useful in other scenarios as well, since some scenarios might silently drop previous exceptions as well (some paths in translateException result with an exception thrown) Thanks, Igal. On Mon, May 21, 2012 at 7:46 PM, Ted Yu <[email protected]> wrote: > Thanks for the analysis. > > It shouldn't be difficult to verify your hypothesis. > In the following code: > } catch (Throwable t) { > t = translateException(t); > exceptions.add(t); > You can add a log to show the type of t along with information about > callable. > > When LeaseException happens again, it would be easier to correlate logs. > > Cheers > > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <[email protected]> wrote: > > > Hi, > > > > We've noticed in our production cluster (0.90.4-cdh3u3) that from time to > > time some of our map tasks fail due to a LeaseException thrown while > > scanning. > > > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both > set > > to 5 minutes. > > > > Whats strange about this, is the sequences of events that cause the maps > to > > fail: > > (Relevant log parts are here: http://pastebin.com/d1yckmz6) > > > > (a) a client calls next(69901879722105864, 100) > > (b) HRegionServer:next tries to call removeLease(69901879722105864) and a > > LeaseException is thrown (lease 69901879722105864 does not exists.) > > (c) few milliseconds later the mapper logs the same error, and terminates > > immediately. > > (d) A minute later we see that the RegionServer$Responder.doRespond fails > > because the stream is closed (our client has died a minute ago) > > (e) Five minutes later (=our lease period) RegionServer's log shows: > > Scanner 69901879722105864 lease expired. > > > > Now that seems pretty odd, especially that (b) happened 5 minutes before > > (e) > > > > This might be possible, IMHO in the following scenario: > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is > > passed to getRegionServerWithRetries > > > > 2. RS accepts it, enters next(69901879722105864, 100), and removes the > > lease assosicated with "69901879722105864". > > > > 3 meanwhile getRegionServerWithRetries catches an exception that is not > of > > type DoNotRetryIOException (perhaps socket timeout?) while waiting for > this > > callable to complete. > > > > getRegionServerWithRetries just silently adds this to a list of > exceptions. > > > > 4. Then a retry causes (b), and then a rethrow of a LeaseException > (masking > > any previous exceptions that were accumulated in (3)). > > > > Is this scenario seems possible to anyone? > > > > Thanks, > > Igal. > > >
