JIRA is down so I cannot see Igal's patch. Normally we start with patch for trunk. After review passes, backport is conducted.
Cheers On Tue, May 22, 2012 at 1:48 PM, Ted Yu <[email protected]> wrote: > Makes sense. > Do you mind opening a JIRA for adding debug log ? > > > On Tue, May 22, 2012 at 1:42 PM, Igal Shilman <[email protected]> wrote: > >> Hi Ted, >> >> Thank you for your reply, I've followed your advice, and added a log >> message in the catch block. >> I've been trying to reproduce the problem (tried running sparse scans, >> long >> job etc'), and it didn't happen yet. >> >> I think that adding a log message there (even at debug level) might be >> useful in other scenarios as well, since some scenarios might silently >> drop >> previous exceptions as well (some paths in translateException result with >> an exception thrown) >> >> Thanks, >> Igal. >> >> On Mon, May 21, 2012 at 7:46 PM, Ted Yu <[email protected]> wrote: >> >> > Thanks for the analysis. >> > >> > It shouldn't be difficult to verify your hypothesis. >> > In the following code: >> > } catch (Throwable t) { >> > >> t = translateException(t); >> > exceptions.add(t); >> > You can add a log to show the type of t along with information about >> > callable. >> > >> > When LeaseException happens again, it would be easier to correlate logs. >> > >> > Cheers >> > >> > On Mon, May 21, 2012 at 7:34 AM, Igal Shilman <[email protected]> wrote: >> > >> > > Hi, >> > > >> > > We've noticed in our production cluster (0.90.4-cdh3u3) that from >> time to >> > > time some of our map tasks fail due to a LeaseException thrown while >> > > scanning. >> > > >> > > We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" >> both >> > set >> > > to 5 minutes. >> > > >> > > Whats strange about this, is the sequences of events that cause the >> maps >> > to >> > > fail: >> > > (Relevant log parts are here: http://pastebin.com/d1yckmz6) >> > > >> > > (a) a client calls next(69901879722105864, 100) >> > > (b) HRegionServer:next tries to call removeLease(69901879722105864) >> and a >> > > LeaseException is thrown (lease 69901879722105864 does not exists.) >> > > (c) few milliseconds later the mapper logs the same error, and >> terminates >> > > immediately. >> > > (d) A minute later we see that the RegionServer$Responder.doRespond >> fails >> > > because the stream is closed (our client has died a minute ago) >> > > (e) Five minutes later (=our lease period) RegionServer's log shows: >> > > Scanner 69901879722105864 lease expired. >> > > >> > > Now that seems pretty odd, especially that (b) happened 5 minutes >> before >> > > (e) >> > > >> > > This might be possible, IMHO in the following scenario: >> > > 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is >> > > passed to getRegionServerWithRetries >> > > >> > > 2. RS accepts it, enters next(69901879722105864, 100), and removes the >> > > lease assosicated with "69901879722105864". >> > > >> > > 3 meanwhile getRegionServerWithRetries catches an exception that is >> not >> > of >> > > type DoNotRetryIOException (perhaps socket timeout?) while waiting for >> > this >> > > callable to complete. >> > > >> > > getRegionServerWithRetries just silently adds this to a list of >> > exceptions. >> > > >> > > 4. Then a retry causes (b), and then a rethrow of a LeaseException >> > (masking >> > > any previous exceptions that were accumulated in (3)). >> > > >> > > Is this scenario seems possible to anyone? >> > > >> > > Thanks, >> > > Igal. >> > > >> > >> > >
