Will do On May 22, 2012 11:48 PM, "Ted Yu" <[email protected]> wrote:
> Makes sense. > Do you mind opening a JIRA for adding debug log ? > > On Tue, May 22, 2012 at 1:42 PM, Igal Shilman <[email protected]> wrote: > > > Hi Ted, > > > > Thank you for your reply, I've followed your advice, and added a log > > message in the catch block. > > I've been trying to reproduce the problem (tried running sparse scans, > long > > job etc'), and it didn't happen yet. > > > > I think that adding a log message there (even at debug level) might be > > useful in other scenarios as well, since some scenarios might silently > drop > > previous exceptions as well (some paths in translateException result with > > an exception thrown) > > > > Thanks, > > Igal. > > > > On Mon, May 21, 2012 at 7:46 PM, Ted Yu <[email protected]> wrote: > > > > > Thanks for the analysis. > > > > > > It shouldn't be difficult to verify your hypothesis. > > > In the following code: > > > } catch (Throwable t) { > > > > > t = translateException(t); > > > exceptions.add(t); > > > You can add a log to show the type of t along with information about > > > callable. > > > > > > When LeaseException happens again, it would be easier to correlate > logs. > > > > > > Cheers > > > > > > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > We've noticed in our production cluster (0.90.4-cdh3u3) that from > time > > to > > > > time some of our map tasks fail due to a LeaseException thrown while > > > > scanning. > > > > > > > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" > both > > > set > > > > to 5 minutes. > > > > > > > > Whats strange about this, is the sequences of events that cause the > > maps > > > to > > > > fail: > > > > (Relevant log parts are here: http://pastebin.com/d1yckmz6) > > > > > > > > (a) a client calls next(69901879722105864, 100) > > > > (b) HRegionServer:next tries to call removeLease(69901879722105864) > > and a > > > > LeaseException is thrown (lease 69901879722105864 does not exists.) > > > > (c) few milliseconds later the mapper logs the same error, and > > terminates > > > > immediately. > > > > (d) A minute later we see that the RegionServer$Responder.doRespond > > fails > > > > because the stream is closed (our client has died a minute ago) > > > > (e) Five minutes later (=our lease period) RegionServer's log shows: > > > > Scanner 69901879722105864 lease expired. > > > > > > > > Now that seems pretty odd, especially that (b) happened 5 minutes > > before > > > > (e) > > > > > > > > This might be possible, IMHO in the following scenario: > > > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is > > > > passed to getRegionServerWithRetries > > > > > > > > 2. RS accepts it, enters next(69901879722105864, 100), and removes > the > > > > lease assosicated with "69901879722105864". > > > > > > > > 3 meanwhile getRegionServerWithRetries catches an exception that is > not > > > of > > > > type DoNotRetryIOException (perhaps socket timeout?) while waiting > for > > > this > > > > callable to complete. > > > > > > > > getRegionServerWithRetries just silently adds this to a list of > > > exceptions. > > > > > > > > 4. Then a retry causes (b), and then a rethrow of a LeaseException > > > (masking > > > > any previous exceptions that were accumulated in (3)). > > > > > > > > Is this scenario seems possible to anyone? > > > > > > > > Thanks, > > > > Igal. > > > > > > > > > >
