Problem solved. It was like I said, the server took more than the hbase.rpc.timeout to run the call and the client closed the connection.
Best Regards, Lucian On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache < [email protected]> wrote: > Yes, I will try to see the SocketTimeoutException after putting log on > debug, because, like it says here > https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug > on the client side. > > Regards, > Lucian > > > On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> So you should see the SocketTimeoutException in your *client* logs (in >> your case, mappers), not LeaseException. At this point yes you're >> going to timeout, but if you spend so much time cycling on the server >> side then you shouldn't set a high caching configuration on your >> scanner as IO isn't your bottle neck. >> >> J-D >> >> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache >> <[email protected]> wrote: >> > Hi, >> > >> > The servers have been restarted (I have this configuration for more than >> a >> > month, so this is not the problem). >> > About the stack traces, they show exactly the same, a lot of >> > ClosedChannelConnections and LeaseExceptions. >> > >> > But I found something that could be the problem: hbase.rpc.timeout . >> This >> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it >> > could happen the next way: >> > - the mapper makes a scanner.next call to the region server >> > - the region servers needs more than 60 seconds to execute it (I use >> > multiple filters, and it could take a lot of time) >> > - the scan client gets the timeout and cuts the connection >> > - the region server tries to send the results to the client ==> >> > ClosedChannelConnection >> > >> > I will get a deeper look into it tomorrow. If you have other >> suggestions, >> > please let me know! >> > >> > Thanks, >> > Lucian >> > >> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans < >> [email protected]>wrote: >> > >> >> Did you restart the region servers after changing the config? >> >> >> >> Are you sure it's the same exception/stack trace? >> >> >> >> J-D >> >> >> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache >> >> <[email protected]> wrote: >> >> > Hi all, >> >> > >> >> > I have exactly the same problem that Eran had. >> >> > But there is something I don't understand: in my case, I have set the >> >> lease >> >> > time to 240000 (4 minutes). But most of the map tasks that are >> failing >> >> run >> >> > about 2 minutes. How is it possible to get a LeaseException if the >> task >> >> runs >> >> > less than the configured time for a lease? >> >> > >> >> > Regards, >> >> > Lucian Iordache >> >> > >> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> >> wrote: >> >> > >> >> >> Perfect! Thanks. >> >> >> >> >> >> -eran >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans < >> [email protected] >> >> >> >wrote: >> >> >> >> >> >> > hbase.regionserver.lease.period >> >> >> > >> >> >> > Set it bigger than 60000. >> >> >> > >> >> >> > J-D >> >> >> > >> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> >> wrote: >> >> >> > > >> >> >> > > Thanks J-D! >> >> >> > > Since my main table is expected to continue growing I guess at >> some >> >> >> point >> >> >> > > even setting the cache size to 1 will not be enough. Is there a >> way >> >> to >> >> >> > > configure the lease timeout? >> >> >> > > >> >> >> > > -eran >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < >> >> [email protected] >> >> >> > >wrote: >> >> >> > > >> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected] >> > >> >> >> wrote: >> >> >> > > > >> >> >> > > > > Hi J-D, >> >> >> > > > > Thanks for the detailed explanation. >> >> >> > > > > So if I understand correctly the lease we're talking about >> is a >> >> >> > scanner >> >> >> > > > > lease and the timeout is between two scanner calls, correct? >> I >> >> >> think >> >> >> > that >> >> >> > > > > make sense because I now realize that jobs that fail (some >> jobs >> >> >> > continued >> >> >> > > > > to >> >> >> > > > > fail even after reducing the number of map tasks as Stack >> >> >> suggested) >> >> >> > use >> >> >> > > > > filters to fetch relatively few rows out of a very large >> table, >> >> so >> >> >> > they >> >> >> > > > > could be spending a lot of time on the region server >> scanning >> >> rows >> >> >> > until >> >> >> > > > it >> >> >> > > > > reached my setCaching value which was 1000. Setting the >> caching >> >> >> value >> >> >> > to >> >> >> > > > 1 >> >> >> > > > > seem to allow these job to complete. >> >> >> > > > > I think it has to be the above, since my rows are small, >> with >> >> just >> >> >> a >> >> >> > few >> >> >> > > > > columns and processing them is very quick. >> >> >> > > > > >> >> >> > > > >> >> >> > > > Excellent! >> >> >> > > > >> >> >> > > > >> >> >> > > > > >> >> >> > > > > However, there are still a couple ofw thing I don't >> understand: >> >> >> > > > > 1. What is the difference between setCaching and setBatch? >> >> >> > > > > >> >> >> > > > >> >> >> > > > * Set the maximum number of values to return for each call to >> >> next() >> >> >> > > > >> >> >> > > > VS >> >> >> > > > >> >> >> > > > * Set the number of rows for caching that will be passed to >> >> scanners. >> >> >> > > > >> >> >> > > > The former is useful if you have rows with millions of columns >> and >> >> >> you >> >> >> > > > could >> >> >> > > > setBatch to get only 1000 of them at a time. You could call >> that >> >> >> > intra-row >> >> >> > > > scanning. >> >> >> > > > >> >> >> > > > >> >> >> > > > > 2. Examining the region server logs more closely than I did >> >> >> yesterday >> >> >> > I >> >> >> > > > see >> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired >> >> leases >> >> >> > (but >> >> >> > > > no >> >> >> > > > > UnknownScannerException), is that expected? You can see an >> >> excerpt >> >> >> of >> >> >> > the >> >> >> > > > > log from one of the region servers here: >> >> >> > http://pastebin.com/NLcZTzsY >> >> >> > > > >> >> >> > > > >> >> >> > > > It means that when the server got to process that client >> request >> >> and >> >> >> > > > started >> >> >> > > > reading from the socket, the client was already gone. Killing >> a >> >> >> client >> >> >> > does >> >> >> > > > that (or killing a MR that scans), so does >> SocketTimeoutException. >> >> >> This >> >> >> > > > should probably go in the book. We should also print something >> >> nicer >> >> >> :) >> >> >> > > > >> >> >> > > > J-D >> >> >> > > > >> >> >> > >> >> >> >> >> > >> >> >> > >> > > > > -- > Numai bine, > Lucian > -- Numai bine, Lucian
