Hi, Do you mind taking a look at HBASE-6071 ? It was submitted as a result of this mail (back at May) http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E
I've recently submitted logs that (I think) confirms this theory. Thanks, Igal. On Thu, Sep 20, 2012 at 4:55 PM, Harsh J <[email protected]> wrote: > Hi Daniel, > > That sounds fine to do (easier a solution, my brain's gotten complex today > ha). > > We should classify the two types of error in the docs for users the > way you have here, to indicate what the issue is in each of the error > cases - UnknownScannerException and LeaseException. Mind filing a > JIRA? :) > > On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <[email protected]> > wrote: > > Thaaank you! I was waiting for this email for months. I've read all the > > posts regarding lease timeouts and see that people usually have them for > 2 > > reasons. One, the normal case where the client app does not process the > row > > fast enough so they get UnknownScannerException and some had the issue > below > > and get LeaseException instead. > > > > How about using a try/catch for the > > > > // Remove lease while its being processed in server; protects against > case > > // where processing of request takes > lease expiration time. > > lease = this.leases.removeLease(scannerName); > > > > and re-throw an IllegalStateException or log a warning message because a > > client with and active scanner but no lease does not seem to be in the > right > > state? > > > > Just an idea but you know better. > > Daniel > > > > On 09/20/2012 03:42 PM, Harsh J wrote: > > > > Hi, > > > > I hit this today and got down to investigate it and one of my > > colleagues discovered this thread. Since I got some more clues, I > > thought I'll bump up this thread for good. > > > > Lucian almost got the issue here. The thing we missed thinking about > > is the client retry. The client of HBaseRPC seems to silently retry on > > timeouts. So if you apply Lucian's theory below and apply that a > > client retry calls next(ID, Rows) yet again, you can construct this > > issue: > > > > - Client calls next(ID, Rows) first time. > > - RS receives the handler-sent request, removes lease (to not expire > > it during next() call) and begins work. > > - RS#next hangs during work (for whatever reason we can assume - large > > values or locks or whatever) > > - Client times out after a minute, retries (due to default nature). > > Retry seems to be silent though? > > - New next(ID, Rows) call is invoked. Scanner still exists so no > > UnknownScanner is thrown. But when next() tries to remove lease, we > > get thrown LeaseException (and the client gets this immediately and > > dies) as the other parallel handler has the lease object already > > removed and held in its stuck state. > > - A few secs/mins later, the original next() unfreezes, adds back > > lease to the queue, tries to write back response, runs into > > ClosedChannelException as the client had already thrown its original > > socket away. End of clients. > > - Lease-period expiry later, the lease is now formally removed without > > any hitches. > > > > Ideally, to prevent this, the rpc.timeout must be > lease period as > > was pointed out. Since in that case, we'd have waited for X units more > > for the original next() to unblock and continue itself and not have > > retried. That is how this is avoided, unintentionally, but can still > > happen if the next() still takes very long. > > > > I haven't seen a LeaseException in any other case so far, so maybe we > > can improve that exception's message to indicate whats going on in > > simpler terms so clients can reconfigure to fix themselves? > > > > Also we could add in some measures to prevent next()-duping, as that > > is never bound to work given the lease-required system. Perhaps when > > the next() stores the removed lease, we can store it somewhere global > > (like ActiveLeases or summat) and deny next() duping if their > > requested lease is already in ActiveLeases? Just ends up giving a > > better message, not a solution. > > > > Hope this helps others who've run into the same issue. > > > > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans > > <[email protected]> wrote: > > > > So you should see the SocketTimeoutException in your *client* logs (in > > your case, mappers), not LeaseException. At this point yes you're > > going to timeout, but if you spend so much time cycling on the server > > side then you shouldn't set a high caching configuration on your > > scanner as IO isn't your bottle neck. > > > > J-D > > > > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache > > <[email protected]> wrote: > > > > Hi, > > > > The servers have been restarted (I have this configuration for more than > a > > month, so this is not the problem). > > About the stack traces, they show exactly the same, a lot of > > ClosedChannelConnections and LeaseExceptions. > > > > But I found something that could be the problem: hbase.rpc.timeout . This > > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it > > could happen the next way: > > - the mapper makes a scanner.next call to the region server > > - the region servers needs more than 60 seconds to execute it (I use > > multiple filters, and it could take a lot of time) > > - the scan client gets the timeout and cuts the connection > > - the region server tries to send the results to the client ==> > > ClosedChannelConnection > > > > I will get a deeper look into it tomorrow. If you have other suggestions, > > please let me know! > > > > Thanks, > > Lucian > > > > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans > > <[email protected]>wrote: > > > > Did you restart the region servers after changing the config? > > > > Are you sure it's the same exception/stack trace? > > > > J-D > > > > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache > > <[email protected]> wrote: > > > > Hi all, > > > > I have exactly the same problem that Eran had. > > But there is something I don't understand: in my case, I have set the > > > > lease > > > > time to 240000 (4 minutes). But most of the map tasks that are failing > > > > run > > > > about 2 minutes. How is it possible to get a LeaseException if the task > > > > runs > > > > less than the configured time for a lease? > > > > Regards, > > Lucian Iordache > > > > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> wrote: > > > > Perfect! Thanks. > > > > -eran > > > > > > > > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <[email protected] > > > > wrote: > > > > hbase.regionserver.lease.period > > > > Set it bigger than 60000. > > > > J-D > > > > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> wrote: > > > > Thanks J-D! > > Since my main table is expected to continue growing I guess at some > > > > point > > > > even setting the cache size to 1 will not be enough. Is there a way > > > > to > > > > configure the lease timeout? > > > > -eran > > > > > > > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < > > > > [email protected] > > > > wrote: > > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected]> > > > > wrote: > > > > Hi J-D, > > Thanks for the detailed explanation. > > So if I understand correctly the lease we're talking about is a > > > > scanner > > > > lease and the timeout is between two scanner calls, correct? I > > > > think > > > > that > > > > make sense because I now realize that jobs that fail (some jobs > > > > continued > > > > to > > fail even after reducing the number of map tasks as Stack > > > > suggested) > > > > use > > > > filters to fetch relatively few rows out of a very large table, > > > > so > > > > they > > > > could be spending a lot of time on the region server scanning > > > > rows > > > > until > > > > it > > > > reached my setCaching value which was 1000. Setting the caching > > > > value > > > > to > > > > 1 > > > > seem to allow these job to complete. > > I think it has to be the above, since my rows are small, with > > > > just > > > > a > > > > few > > > > columns and processing them is very quick. > > > > Excellent! > > > > > > However, there are still a couple ofw thing I don't understand: > > 1. What is the difference between setCaching and setBatch? > > > > * Set the maximum number of values to return for each call to > > > > next() > > > > VS > > > > * Set the number of rows for caching that will be passed to > > > > scanners. > > > > The former is useful if you have rows with millions of columns and > > > > you > > > > could > > setBatch to get only 1000 of them at a time. You could call that > > > > intra-row > > > > scanning. > > > > > > 2. Examining the region server logs more closely than I did > > > > yesterday > > > > I > > > > see > > > > a log of ClosedChannelExceptions in addition to the expired > > > > leases > > > > (but > > > > no > > > > UnknownScannerException), is that expected? You can see an > > > > excerpt > > > > of > > > > the > > > > log from one of the region servers here: > > > > http://pastebin.com/NLcZTzsY > > > > It means that when the server got to process that client request > > > > and > > > > started > > reading from the socket, the client was already gone. Killing a > > > > client > > > > does > > > > that (or killing a MR that scans), so does SocketTimeoutException. > > > > This > > > > should probably go in the book. We should also print something > > > > nicer > > > > :) > > > > J-D > > > > > > > > -- > > Harsh J > > > > > > > > -- > Harsh J >
