Hello, I would suggest logging the exception produced by the hbase.rpc.timeout on the client side on WARN, not debug like it is right now.
Regards, Lucian On Wed, Oct 26, 2011 at 6:51 PM, Stack <[email protected]> wrote: > What would you suggest we do to improve the messages we emit around > here making it more clear whats going on? > > St.Ack > > On Tue, Oct 25, 2011 at 1:15 AM, Lucian Iordache > <[email protected]> wrote: > > Yes, I will try to see the SocketTimeoutException after putting log on > > debug, because, like it says here > > https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on > debug > > on the client side. > > > > Regards, > > Lucian > > > > On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> So you should see the SocketTimeoutException in your *client* logs (in > >> your case, mappers), not LeaseException. At this point yes you're > >> going to timeout, but if you spend so much time cycling on the server > >> side then you shouldn't set a high caching configuration on your > >> scanner as IO isn't your bottle neck. > >> > >> J-D > >> > >> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache > >> <[email protected]> wrote: > >> > Hi, > >> > > >> > The servers have been restarted (I have this configuration for more > than > >> a > >> > month, so this is not the problem). > >> > About the stack traces, they show exactly the same, a lot of > >> > ClosedChannelConnections and LeaseExceptions. > >> > > >> > But I found something that could be the problem: hbase.rpc.timeout . > This > >> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So > it > >> > could happen the next way: > >> > - the mapper makes a scanner.next call to the region server > >> > - the region servers needs more than 60 seconds to execute it (I use > >> > multiple filters, and it could take a lot of time) > >> > - the scan client gets the timeout and cuts the connection > >> > - the region server tries to send the results to the client ==> > >> > ClosedChannelConnection > >> > > >> > I will get a deeper look into it tomorrow. If you have other > suggestions, > >> > please let me know! > >> > > >> > Thanks, > >> > Lucian > >> > > >> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans < > [email protected] > >> >wrote: > >> > > >> >> Did you restart the region servers after changing the config? > >> >> > >> >> Are you sure it's the same exception/stack trace? > >> >> > >> >> J-D > >> >> > >> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache > >> >> <[email protected]> wrote: > >> >> > Hi all, > >> >> > > >> >> > I have exactly the same problem that Eran had. > >> >> > But there is something I don't understand: in my case, I have set > the > >> >> lease > >> >> > time to 240000 (4 minutes). But most of the map tasks that are > failing > >> >> run > >> >> > about 2 minutes. How is it possible to get a LeaseException if the > >> task > >> >> runs > >> >> > less than the configured time for a lease? > >> >> > > >> >> > Regards, > >> >> > Lucian Iordache > >> >> > > >> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> > wrote: > >> >> > > >> >> >> Perfect! Thanks. > >> >> >> > >> >> >> -eran > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans < > >> [email protected] > >> >> >> >wrote: > >> >> >> > >> >> >> > hbase.regionserver.lease.period > >> >> >> > > >> >> >> > Set it bigger than 60000. > >> >> >> > > >> >> >> > J-D > >> >> >> > > >> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> > >> wrote: > >> >> >> > > > >> >> >> > > Thanks J-D! > >> >> >> > > Since my main table is expected to continue growing I guess at > >> some > >> >> >> point > >> >> >> > > even setting the cache size to 1 will not be enough. Is there > a > >> way > >> >> to > >> >> >> > > configure the lease timeout? > >> >> >> > > > >> >> >> > > -eran > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans < > >> >> [email protected] > >> >> >> > >wrote: > >> >> >> > > > >> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner < > [email protected]> > >> >> >> wrote: > >> >> >> > > > > >> >> >> > > > > Hi J-D, > >> >> >> > > > > Thanks for the detailed explanation. > >> >> >> > > > > So if I understand correctly the lease we're talking about > is > >> a > >> >> >> > scanner > >> >> >> > > > > lease and the timeout is between two scanner calls, > correct? > >> I > >> >> >> think > >> >> >> > that > >> >> >> > > > > make sense because I now realize that jobs that fail (some > >> jobs > >> >> >> > continued > >> >> >> > > > > to > >> >> >> > > > > fail even after reducing the number of map tasks as Stack > >> >> >> suggested) > >> >> >> > use > >> >> >> > > > > filters to fetch relatively few rows out of a very large > >> table, > >> >> so > >> >> >> > they > >> >> >> > > > > could be spending a lot of time on the region server > scanning > >> >> rows > >> >> >> > until > >> >> >> > > > it > >> >> >> > > > > reached my setCaching value which was 1000. Setting the > >> caching > >> >> >> value > >> >> >> > to > >> >> >> > > > 1 > >> >> >> > > > > seem to allow these job to complete. > >> >> >> > > > > I think it has to be the above, since my rows are small, > with > >> >> just > >> >> >> a > >> >> >> > few > >> >> >> > > > > columns and processing them is very quick. > >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > Excellent! > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > > >> >> >> > > > > However, there are still a couple ofw thing I don't > >> understand: > >> >> >> > > > > 1. What is the difference between setCaching and setBatch? > >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > * Set the maximum number of values to return for each call > to > >> >> next() > >> >> >> > > > > >> >> >> > > > VS > >> >> >> > > > > >> >> >> > > > * Set the number of rows for caching that will be passed to > >> >> scanners. > >> >> >> > > > > >> >> >> > > > The former is useful if you have rows with millions of > columns > >> and > >> >> >> you > >> >> >> > > > could > >> >> >> > > > setBatch to get only 1000 of them at a time. You could call > >> that > >> >> >> > intra-row > >> >> >> > > > scanning. > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > 2. Examining the region server logs more closely than I > did > >> >> >> yesterday > >> >> >> > I > >> >> >> > > > see > >> >> >> > > > > a log of ClosedChannelExceptions in addition to the > expired > >> >> leases > >> >> >> > (but > >> >> >> > > > no > >> >> >> > > > > UnknownScannerException), is that expected? You can see an > >> >> excerpt > >> >> >> of > >> >> >> > the > >> >> >> > > > > log from one of the region servers here: > >> >> >> > http://pastebin.com/NLcZTzsY > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > It means that when the server got to process that client > >> request > >> >> and > >> >> >> > > > started > >> >> >> > > > reading from the socket, the client was already gone. > Killing a > >> >> >> client > >> >> >> > does > >> >> >> > > > that (or killing a MR that scans), so does > >> SocketTimeoutException. > >> >> >> This > >> >> >> > > > should probably go in the book. We should also print > something > >> >> nicer > >> >> >> :) > >> >> >> > > > > >> >> >> > > > J-D > >> >> >> > > > > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> > > > > > > > > -- > > Numai bine, > > Lucian > > >
