I'll add something in the docs.
On 10/27/11 3:35 AM, "Lucian Iordache" <[email protected]> wrote: >Yep. did not work entirely. > >I had a job to run on 1000 regions. And the caching was 200. The job >crashed >with a lot of ClosedChannelExceptions + LeaseExceptions. > >Set the caching to 10 ==> the same. >Set the caching to 1 ==> ~600 successfully completed tasks, but still a >lot >of them crashed ==> job crashed >Set the hbase.rpc.timeout to 240000 (which is the lease timeout on the >region server) ==> the job completed successfully, without any failed >attempts. > >The problem was that we have some very large regions (2GB) and there are >some of them with very few data, that's why it takes more than 60 seconds >to >get even the first row. As Daniel said, in the documentation of the lease >timeout for regionserver and the hbase.rpc.timeout should be mentioned to >be >careful when modifying them, because you can get to problems, like in our >case. > >Regards, >Lucian > >On Wed, Oct 26, 2011 at 7:53 PM, Jean-Daniel Cryans ><[email protected]>wrote: > >> Did you try setting the scanner caching down like I mentioned? >> >> J-D >> >> On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache >> <[email protected]> wrote: >> > Problem solved. It was like I said, the server took more than the >> > hbase.rpc.timeout to run the call and the client closed the >>connection. >> > >> > Best Regards, >> > Lucian >> > >> > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache < >> > [email protected]> wrote: >> > >> >> Yes, I will try to see the SocketTimeoutException after putting log >>on >> >> debug, because, like it says here >> >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on >> debug >> >> on the client side. >> >> >> >> Regards, >> >> Lucian >> >> >> >> >> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans < >> [email protected]>wrote: >> >> >> >>> So you should see the SocketTimeoutException in your *client* logs >>(in >> >>> your case, mappers), not LeaseException. At this point yes you're >> >>> going to timeout, but if you spend so much time cycling on the >>server >> >>> side then you shouldn't set a high caching configuration on your >> >>> scanner as IO isn't your bottle neck. >> >>> >> >>> J-D >> >>> >> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache >> >>> <[email protected]> wrote: >> >>> > Hi, >> >>> > >> >>> > The servers have been restarted (I have this configuration for >>more >> than >> >>> a >> >>> > month, so this is not the problem). >> >>> > About the stack traces, they show exactly the same, a lot of >> >>> > ClosedChannelConnections and LeaseExceptions. >> >>> > >> >>> > But I found something that could be the problem: >>hbase.rpc.timeout . >> >>> This >> >>> > defaults to 60 seconds, and I did not modify it in >>hbase-site.xml. So >> it >> >>> > could happen the next way: >> >>> > - the mapper makes a scanner.next call to the region server >> >>> > - the region servers needs more than 60 seconds to execute it (I >>use >> >>> > multiple filters, and it could take a lot of time) >> >>> > - the scan client gets the timeout and cuts the connection >> >>> > - the region server tries to send the results to the client ==> >> >>> > ClosedChannelConnection >> >>> > >> >>> > I will get a deeper look into it tomorrow. If you have other >> >>> suggestions, >> >>> > please let me know! >> >>> > >> >>> > Thanks, >> >>> > Lucian >> >>> > >> >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans < >> >>> [email protected]>wrote: >> >>> > >> >>> >> Did you restart the region servers after changing the config? >> >>> >> >> >>> >> Are you sure it's the same exception/stack trace? >> >>> >> >> >>> >> J-D >> >>> >> >> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache >> >>> >> <[email protected]> wrote: >> >>> >> > Hi all, >> >>> >> > >> >>> >> > I have exactly the same problem that Eran had. >> >>> >> > But there is something I don't understand: in my case, I have >>set >> the >> >>> >> lease >> >>> >> > time to 240000 (4 minutes). But most of the map tasks that are >> >>> failing >> >>> >> run >> >>> >> > about 2 minutes. How is it possible to get a LeaseException if >>the >> >>> task >> >>> >> runs >> >>> >> > less than the configured time for a lease? >> >>> >> > >> >>> >> > Regards, >> >>> >> > Lucian Iordache >> >>> >> > >> >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> >> >>> wrote: >> >>> >> > >> >>> >> >> Perfect! Thanks. >> >>> >> >> >> >>> >> >> -eran >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans < >> >>> [email protected] >> >>> >> >> >wrote: >> >>> >> >> >> >>> >> >> > hbase.regionserver.lease.period >> >>> >> >> > >> >>> >> >> > Set it bigger than 60000. >> >>> >> >> > >> >>> >> >> > J-D >> >>> >> >> > >> >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner >><[email protected]> >> >>> wrote: >> >>> >> >> > > >> >>> >> >> > > Thanks J-D! >> >>> >> >> > > Since my main table is expected to continue growing I >>guess >> at >> >>> some >> >>> >> >> point >> >>> >> >> > > even setting the cache size to 1 will not be enough. Is >>there >> a >> >>> way >> >>> >> to >> >>> >> >> > > configure the lease timeout? >> >>> >> >> > > >> >>> >> >> > > -eran >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < >> >>> >> [email protected] >> >>> >> >> > >wrote: >> >>> >> >> > > >> >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner < >> [email protected] >> >>> > >> >>> >> >> wrote: >> >>> >> >> > > > >> >>> >> >> > > > > Hi J-D, >> >>> >> >> > > > > Thanks for the detailed explanation. >> >>> >> >> > > > > So if I understand correctly the lease we're talking >> about >> >>> is a >> >>> >> >> > scanner >> >>> >> >> > > > > lease and the timeout is between two scanner calls, >> correct? >> >>> I >> >>> >> >> think >> >>> >> >> > that >> >>> >> >> > > > > make sense because I now realize that jobs that fail >> (some >> >>> jobs >> >>> >> >> > continued >> >>> >> >> > > > > to >> >>> >> >> > > > > fail even after reducing the number of map tasks as >>Stack >> >>> >> >> suggested) >> >>> >> >> > use >> >>> >> >> > > > > filters to fetch relatively few rows out of a very >>large >> >>> table, >> >>> >> so >> >>> >> >> > they >> >>> >> >> > > > > could be spending a lot of time on the region server >> >>> scanning >> >>> >> rows >> >>> >> >> > until >> >>> >> >> > > > it >> >>> >> >> > > > > reached my setCaching value which was 1000. Setting >>the >> >>> caching >> >>> >> >> value >> >>> >> >> > to >> >>> >> >> > > > 1 >> >>> >> >> > > > > seem to allow these job to complete. >> >>> >> >> > > > > I think it has to be the above, since my rows are >>small, >> >>> with >> >>> >> just >> >>> >> >> a >> >>> >> >> > few >> >>> >> >> > > > > columns and processing them is very quick. >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > > Excellent! >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > > > >> >>> >> >> > > > > However, there are still a couple ofw thing I don't >> >>> understand: >> >>> >> >> > > > > 1. What is the difference between setCaching and >> setBatch? >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > > * Set the maximum number of values to return for each >>call >> to >> >>> >> next() >> >>> >> >> > > > >> >>> >> >> > > > VS >> >>> >> >> > > > >> >>> >> >> > > > * Set the number of rows for caching that will be >>passed to >> >>> >> scanners. >> >>> >> >> > > > >> >>> >> >> > > > The former is useful if you have rows with millions of >> columns >> >>> and >> >>> >> >> you >> >>> >> >> > > > could >> >>> >> >> > > > setBatch to get only 1000 of them at a time. You could >>call >> >>> that >> >>> >> >> > intra-row >> >>> >> >> > > > scanning. >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > > > 2. Examining the region server logs more closely than >>I >> did >> >>> >> >> yesterday >> >>> >> >> > I >> >>> >> >> > > > see >> >>> >> >> > > > > a log of ClosedChannelExceptions in addition to the >> expired >> >>> >> leases >> >>> >> >> > (but >> >>> >> >> > > > no >> >>> >> >> > > > > UnknownScannerException), is that expected? You can >>see >> an >> >>> >> excerpt >> >>> >> >> of >> >>> >> >> > the >> >>> >> >> > > > > log from one of the region servers here: >> >>> >> >> > http://pastebin.com/NLcZTzsY >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > > It means that when the server got to process that client >> >>> request >> >>> >> and >> >>> >> >> > > > started >> >>> >> >> > > > reading from the socket, the client was already gone. >> Killing >> >>> a >> >>> >> >> client >> >>> >> >> > does >> >>> >> >> > > > that (or killing a MR that scans), so does >> >>> SocketTimeoutException. >> >>> >> >> This >> >>> >> >> > > > should probably go in the book. We should also print >> something >> >>> >> nicer >> >>> >> >> :) >> >>> >> >> > > > >> >>> >> >> > > > J-D >> >>> >> >> > > > >> >>> >> >> > >> >>> >> >> >> >>> >> > >> >>> >> >> >>> > >> >>> >> >> >> >> >> >> >> >> -- >> >> Numai bine, >> >> Lucian >> >> >> > >> > >> > >> > -- >> > Numai bine, >> > Lucian >> > >> > > > >-- >Numai bine, >Lucian
