Did you try setting the scanner caching down like I mentioned? J-D
On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache <[email protected]> wrote: > Problem solved. It was like I said, the server took more than the > hbase.rpc.timeout to run the call and the client closed the connection. > > Best Regards, > Lucian > > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache < > [email protected]> wrote: > >> Yes, I will try to see the SocketTimeoutException after putting log on >> debug, because, like it says here >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug >> on the client side. >> >> Regards, >> Lucian >> >> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans >> <[email protected]>wrote: >> >>> So you should see the SocketTimeoutException in your *client* logs (in >>> your case, mappers), not LeaseException. At this point yes you're >>> going to timeout, but if you spend so much time cycling on the server >>> side then you shouldn't set a high caching configuration on your >>> scanner as IO isn't your bottle neck. >>> >>> J-D >>> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache >>> <[email protected]> wrote: >>> > Hi, >>> > >>> > The servers have been restarted (I have this configuration for more than >>> a >>> > month, so this is not the problem). >>> > About the stack traces, they show exactly the same, a lot of >>> > ClosedChannelConnections and LeaseExceptions. >>> > >>> > But I found something that could be the problem: hbase.rpc.timeout . >>> This >>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it >>> > could happen the next way: >>> > - the mapper makes a scanner.next call to the region server >>> > - the region servers needs more than 60 seconds to execute it (I use >>> > multiple filters, and it could take a lot of time) >>> > - the scan client gets the timeout and cuts the connection >>> > - the region server tries to send the results to the client ==> >>> > ClosedChannelConnection >>> > >>> > I will get a deeper look into it tomorrow. If you have other >>> suggestions, >>> > please let me know! >>> > >>> > Thanks, >>> > Lucian >>> > >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans < >>> [email protected]>wrote: >>> > >>> >> Did you restart the region servers after changing the config? >>> >> >>> >> Are you sure it's the same exception/stack trace? >>> >> >>> >> J-D >>> >> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache >>> >> <[email protected]> wrote: >>> >> > Hi all, >>> >> > >>> >> > I have exactly the same problem that Eran had. >>> >> > But there is something I don't understand: in my case, I have set the >>> >> lease >>> >> > time to 240000 (4 minutes). But most of the map tasks that are >>> failing >>> >> run >>> >> > about 2 minutes. How is it possible to get a LeaseException if the >>> task >>> >> runs >>> >> > less than the configured time for a lease? >>> >> > >>> >> > Regards, >>> >> > Lucian Iordache >>> >> > >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> >>> wrote: >>> >> > >>> >> >> Perfect! Thanks. >>> >> >> >>> >> >> -eran >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans < >>> [email protected] >>> >> >> >wrote: >>> >> >> >>> >> >> > hbase.regionserver.lease.period >>> >> >> > >>> >> >> > Set it bigger than 60000. >>> >> >> > >>> >> >> > J-D >>> >> >> > >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> >>> wrote: >>> >> >> > > >>> >> >> > > Thanks J-D! >>> >> >> > > Since my main table is expected to continue growing I guess at >>> some >>> >> >> point >>> >> >> > > even setting the cache size to 1 will not be enough. Is there a >>> way >>> >> to >>> >> >> > > configure the lease timeout? >>> >> >> > > >>> >> >> > > -eran >>> >> >> > > >>> >> >> > > >>> >> >> > > >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < >>> >> [email protected] >>> >> >> > >wrote: >>> >> >> > > >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected] >>> > >>> >> >> wrote: >>> >> >> > > > >>> >> >> > > > > Hi J-D, >>> >> >> > > > > Thanks for the detailed explanation. >>> >> >> > > > > So if I understand correctly the lease we're talking about >>> is a >>> >> >> > scanner >>> >> >> > > > > lease and the timeout is between two scanner calls, correct? >>> I >>> >> >> think >>> >> >> > that >>> >> >> > > > > make sense because I now realize that jobs that fail (some >>> jobs >>> >> >> > continued >>> >> >> > > > > to >>> >> >> > > > > fail even after reducing the number of map tasks as Stack >>> >> >> suggested) >>> >> >> > use >>> >> >> > > > > filters to fetch relatively few rows out of a very large >>> table, >>> >> so >>> >> >> > they >>> >> >> > > > > could be spending a lot of time on the region server >>> scanning >>> >> rows >>> >> >> > until >>> >> >> > > > it >>> >> >> > > > > reached my setCaching value which was 1000. Setting the >>> caching >>> >> >> value >>> >> >> > to >>> >> >> > > > 1 >>> >> >> > > > > seem to allow these job to complete. >>> >> >> > > > > I think it has to be the above, since my rows are small, >>> with >>> >> just >>> >> >> a >>> >> >> > few >>> >> >> > > > > columns and processing them is very quick. >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > > Excellent! >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > > >>> >> >> > > > > However, there are still a couple ofw thing I don't >>> understand: >>> >> >> > > > > 1. What is the difference between setCaching and setBatch? >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > > * Set the maximum number of values to return for each call to >>> >> next() >>> >> >> > > > >>> >> >> > > > VS >>> >> >> > > > >>> >> >> > > > * Set the number of rows for caching that will be passed to >>> >> scanners. >>> >> >> > > > >>> >> >> > > > The former is useful if you have rows with millions of columns >>> and >>> >> >> you >>> >> >> > > > could >>> >> >> > > > setBatch to get only 1000 of them at a time. You could call >>> that >>> >> >> > intra-row >>> >> >> > > > scanning. >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > > 2. Examining the region server logs more closely than I did >>> >> >> yesterday >>> >> >> > I >>> >> >> > > > see >>> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired >>> >> leases >>> >> >> > (but >>> >> >> > > > no >>> >> >> > > > > UnknownScannerException), is that expected? You can see an >>> >> excerpt >>> >> >> of >>> >> >> > the >>> >> >> > > > > log from one of the region servers here: >>> >> >> > http://pastebin.com/NLcZTzsY >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > It means that when the server got to process that client >>> request >>> >> and >>> >> >> > > > started >>> >> >> > > > reading from the socket, the client was already gone. Killing >>> a >>> >> >> client >>> >> >> > does >>> >> >> > > > that (or killing a MR that scans), so does >>> SocketTimeoutException. >>> >> >> This >>> >> >> > > > should probably go in the book. We should also print something >>> >> nicer >>> >> >> :) >>> >> >> > > > >>> >> >> > > > J-D >>> >> >> > > > >>> >> >> > >>> >> >> >>> >> > >>> >> >>> > >>> >> >> >> >> -- >> Numai bine, >> Lucian >> > > > > -- > Numai bine, > Lucian >
