Hi, The servers have been restarted (I have this configuration for more than a month, so this is not the problem). About the stack traces, they show exactly the same, a lot of ClosedChannelConnections and LeaseExceptions.
But I found something that could be the problem: hbase.rpc.timeout . This defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it could happen the next way: - the mapper makes a scanner.next call to the region server - the region servers needs more than 60 seconds to execute it (I use multiple filters, and it could take a lot of time) - the scan client gets the timeout and cuts the connection - the region server tries to send the results to the client ==> ClosedChannelConnection I will get a deeper look into it tomorrow. If you have other suggestions, please let me know! Thanks, Lucian On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <[email protected]>wrote: > Did you restart the region servers after changing the config? > > Are you sure it's the same exception/stack trace? > > J-D > > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache > <[email protected]> wrote: > > Hi all, > > > > I have exactly the same problem that Eran had. > > But there is something I don't understand: in my case, I have set the > lease > > time to 240000 (4 minutes). But most of the map tasks that are failing > run > > about 2 minutes. How is it possible to get a LeaseException if the task > runs > > less than the configured time for a lease? > > > > Regards, > > Lucian Iordache > > > > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> wrote: > > > >> Perfect! Thanks. > >> > >> -eran > >> > >> > >> > >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <[email protected] > >> >wrote: > >> > >> > hbase.regionserver.lease.period > >> > > >> > Set it bigger than 60000. > >> > > >> > J-D > >> > > >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> wrote: > >> > > > >> > > Thanks J-D! > >> > > Since my main table is expected to continue growing I guess at some > >> point > >> > > even setting the cache size to 1 will not be enough. Is there a way > to > >> > > configure the lease timeout? > >> > > > >> > > -eran > >> > > > >> > > > >> > > > >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < > [email protected] > >> > >wrote: > >> > > > >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected]> > >> wrote: > >> > > > > >> > > > > Hi J-D, > >> > > > > Thanks for the detailed explanation. > >> > > > > So if I understand correctly the lease we're talking about is a > >> > scanner > >> > > > > lease and the timeout is between two scanner calls, correct? I > >> think > >> > that > >> > > > > make sense because I now realize that jobs that fail (some jobs > >> > continued > >> > > > > to > >> > > > > fail even after reducing the number of map tasks as Stack > >> suggested) > >> > use > >> > > > > filters to fetch relatively few rows out of a very large table, > so > >> > they > >> > > > > could be spending a lot of time on the region server scanning > rows > >> > until > >> > > > it > >> > > > > reached my setCaching value which was 1000. Setting the caching > >> value > >> > to > >> > > > 1 > >> > > > > seem to allow these job to complete. > >> > > > > I think it has to be the above, since my rows are small, with > just > >> a > >> > few > >> > > > > columns and processing them is very quick. > >> > > > > > >> > > > > >> > > > Excellent! > >> > > > > >> > > > > >> > > > > > >> > > > > However, there are still a couple ofw thing I don't understand: > >> > > > > 1. What is the difference between setCaching and setBatch? > >> > > > > > >> > > > > >> > > > * Set the maximum number of values to return for each call to > next() > >> > > > > >> > > > VS > >> > > > > >> > > > * Set the number of rows for caching that will be passed to > scanners. > >> > > > > >> > > > The former is useful if you have rows with millions of columns and > >> you > >> > > > could > >> > > > setBatch to get only 1000 of them at a time. You could call that > >> > intra-row > >> > > > scanning. > >> > > > > >> > > > > >> > > > > 2. Examining the region server logs more closely than I did > >> yesterday > >> > I > >> > > > see > >> > > > > a log of ClosedChannelExceptions in addition to the expired > leases > >> > (but > >> > > > no > >> > > > > UnknownScannerException), is that expected? You can see an > excerpt > >> of > >> > the > >> > > > > log from one of the region servers here: > >> > http://pastebin.com/NLcZTzsY > >> > > > > >> > > > > >> > > > It means that when the server got to process that client request > and > >> > > > started > >> > > > reading from the socket, the client was already gone. Killing a > >> client > >> > does > >> > > > that (or killing a MR that scans), so does SocketTimeoutException. > >> This > >> > > > should probably go in the book. We should also print something > nicer > >> :) > >> > > > > >> > > > J-D > >> > > > > >> > > >> > > >
