So you should see the SocketTimeoutException in your *client* logs (in
your case, mappers), not LeaseException. At this point yes you're
going to timeout, but if you spend so much time cycling on the server
side then you shouldn't set a high caching configuration on your
scanner as IO isn't your bottle neck.

J-D

On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
<[email protected]> wrote:
> Hi,
>
> The servers have been restarted (I have this configuration for more than a
> month, so this is not the problem).
> About the stack traces, they show exactly the same, a lot of
> ClosedChannelConnections and LeaseExceptions.
>
> But I found something that could be the problem: hbase.rpc.timeout . This
> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> could happen the next way:
> - the mapper makes a scanner.next call to the region server
> - the region servers needs more than 60 seconds to execute it (I use
> multiple filters, and it could take a lot of time)
> - the scan client gets the timeout and cuts the connection
> - the region server tries to send the results to the client ==>
> ClosedChannelConnection
>
> I will get a deeper look into it tomorrow. If you have other suggestions,
> please let me know!
>
> Thanks,
> Lucian
>
> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> Did you restart the region servers after changing the config?
>>
>> Are you sure it's the same exception/stack trace?
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> <[email protected]> wrote:
>> > Hi all,
>> >
>> > I have exactly the same problem that Eran had.
>> > But there is something I don't understand: in my case, I have set the
>> lease
>> > time to 240000 (4 minutes). But most of the map tasks that are failing
>> run
>> > about 2 minutes. How is it possible to get a LeaseException if the task
>> runs
>> > less than the configured time for a lease?
>> >
>> > Regards,
>> > Lucian Iordache
>> >
>> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> wrote:
>> >
>> >> Perfect! Thanks.
>> >>
>> >> -eran
>> >>
>> >>
>> >>
>> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <[email protected]
>> >> >wrote:
>> >>
>> >> > hbase.regionserver.lease.period
>> >> >
>> >> > Set it bigger than 60000.
>> >> >
>> >> > J-D
>> >> >
>> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> wrote:
>> >> > >
>> >> > > Thanks J-D!
>> >> > > Since my main table is expected to continue growing I guess at some
>> >> point
>> >> > > even setting the cache size to 1 will not be enough. Is there a way
>> to
>> >> > > configure the lease timeout?
>> >> > >
>> >> > > -eran
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> [email protected]
>> >> > >wrote:
>> >> > >
>> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected]>
>> >> wrote:
>> >> > > >
>> >> > > > > Hi J-D,
>> >> > > > > Thanks for the detailed explanation.
>> >> > > > > So if I understand correctly the lease we're talking about is a
>> >> > scanner
>> >> > > > > lease and the timeout is between two scanner calls, correct? I
>> >> think
>> >> > that
>> >> > > > > make sense because I now realize that jobs that fail (some jobs
>> >> > continued
>> >> > > > > to
>> >> > > > > fail even after reducing the number of map tasks as Stack
>> >> suggested)
>> >> > use
>> >> > > > > filters to fetch relatively few rows out of a very large table,
>> so
>> >> > they
>> >> > > > > could be spending a lot of time on the region server scanning
>> rows
>> >> > until
>> >> > > > it
>> >> > > > > reached my setCaching value which was 1000. Setting the caching
>> >> value
>> >> > to
>> >> > > > 1
>> >> > > > > seem to allow these job to complete.
>> >> > > > > I think it has to be the above, since my rows are small, with
>> just
>> >> a
>> >> > few
>> >> > > > > columns and processing them is very quick.
>> >> > > > >
>> >> > > >
>> >> > > > Excellent!
>> >> > > >
>> >> > > >
>> >> > > > >
>> >> > > > > However, there are still a couple ofw thing I don't understand:
>> >> > > > > 1. What is the difference between setCaching and setBatch?
>> >> > > > >
>> >> > > >
>> >> > > > * Set the maximum number of values to return for each call to
>> next()
>> >> > > >
>> >> > > > VS
>> >> > > >
>> >> > > > * Set the number of rows for caching that will be passed to
>> scanners.
>> >> > > >
>> >> > > > The former is useful if you have rows with millions of columns and
>> >> you
>> >> > > > could
>> >> > > > setBatch to get only 1000 of them at a time. You could call that
>> >> > intra-row
>> >> > > > scanning.
>> >> > > >
>> >> > > >
>> >> > > > > 2. Examining the region server logs more closely than I did
>> >> yesterday
>> >> > I
>> >> > > > see
>> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>> leases
>> >> > (but
>> >> > > > no
>> >> > > > > UnknownScannerException), is that expected? You can see an
>> excerpt
>> >> of
>> >> > the
>> >> > > > > log from one of the region servers here:
>> >> > http://pastebin.com/NLcZTzsY
>> >> > > >
>> >> > > >
>> >> > > > It means that when the server got to process that client request
>> and
>> >> > > > started
>> >> > > > reading from the socket, the client was already gone. Killing a
>> >> client
>> >> > does
>> >> > > > that (or killing a MR that scans), so does SocketTimeoutException.
>> >> This
>> >> > > > should probably go in the book. We should also print something
>> nicer
>> >> :)
>> >> > > >
>> >> > > > J-D
>> >> > > >
>> >> >
>> >>
>> >
>>
>

Reply via email to