Hi, We've noticed in our production cluster (0.90.4-cdh3u3) that from time to time some of our map tasks fail due to a LeaseException thrown while scanning.
We have "hbase.regionserver.lease.period", and "hbase.rpc.timeout" both set to 5 minutes. Whats strange about this, is the sequences of events that cause the maps to fail: (Relevant log parts are here: http://pastebin.com/d1yckmz6) (a) a client calls next(69901879722105864, 100) (b) HRegionServer:next tries to call removeLease(69901879722105864) and a LeaseException is thrown (lease 69901879722105864 does not exists.) (c) few milliseconds later the mapper logs the same error, and terminates immediately. (d) A minute later we see that the RegionServer$Responder.doRespond fails because the stream is closed (our client has died a minute ago) (e) Five minutes later (=our lease period) RegionServer's log shows: Scanner 69901879722105864 lease expired. Now that seems pretty odd, especially that (b) happened 5 minutes before (e) This might be possible, IMHO in the following scenario: 1. A ScannerCallable wishing to call: next(69901879722105864, 100) is passed to getRegionServerWithRetries 2. RS accepts it, enters next(69901879722105864, 100), and removes the lease assosicated with "69901879722105864". 3 meanwhile getRegionServerWithRetries catches an exception that is not of type DoNotRetryIOException (perhaps socket timeout?) while waiting for this callable to complete. getRegionServerWithRetries just silently adds this to a list of exceptions. 4. Then a retry causes (b), and then a rethrow of a LeaseException (masking any previous exceptions that were accumulated in (3)). Is this scenario seems possible to anyone? Thanks, Igal.
