Problem solved. It was like I said, the server took more than the
hbase.rpc.timeout to run the call and the client closed the connection.

Best Regards,
Lucian

On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
[email protected]> wrote:

> Yes, I will try to see the SocketTimeoutException after putting log on
> debug, because, like it says here
> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug
> on the client side.
>
> Regards,
> Lucian
>
>
> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> So you should see the SocketTimeoutException in your *client* logs (in
>> your case, mappers), not LeaseException. At this point yes you're
>> going to timeout, but if you spend so much time cycling on the server
>> side then you shouldn't set a high caching configuration on your
>> scanner as IO isn't your bottle neck.
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > The servers have been restarted (I have this configuration for more than
>> a
>> > month, so this is not the problem).
>> > About the stack traces, they show exactly the same, a lot of
>> > ClosedChannelConnections and LeaseExceptions.
>> >
>> > But I found something that could be the problem: hbase.rpc.timeout .
>> This
>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> > could happen the next way:
>> > - the mapper makes a scanner.next call to the region server
>> > - the region servers needs more than 60 seconds to execute it (I use
>> > multiple filters, and it could take a lot of time)
>> > - the scan client gets the timeout and cuts the connection
>> > - the region server tries to send the results to the client ==>
>> > ClosedChannelConnection
>> >
>> > I will get a deeper look into it tomorrow. If you have other
>> suggestions,
>> > please let me know!
>> >
>> > Thanks,
>> > Lucian
>> >
>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
>> [email protected]>wrote:
>> >
>> >> Did you restart the region servers after changing the config?
>> >>
>> >> Are you sure it's the same exception/stack trace?
>> >>
>> >> J-D
>> >>
>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> >> <[email protected]> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have exactly the same problem that Eran had.
>> >> > But there is something I don't understand: in my case, I have set the
>> >> lease
>> >> > time to 240000 (4 minutes). But most of the map tasks that are
>> failing
>> >> run
>> >> > about 2 minutes. How is it possible to get a LeaseException if the
>> task
>> >> runs
>> >> > less than the configured time for a lease?
>> >> >
>> >> > Regards,
>> >> > Lucian Iordache
>> >> >
>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]>
>> wrote:
>> >> >
>> >> >> Perfect! Thanks.
>> >> >>
>> >> >> -eran
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
>> [email protected]
>> >> >> >wrote:
>> >> >>
>> >> >> > hbase.regionserver.lease.period
>> >> >> >
>> >> >> > Set it bigger than 60000.
>> >> >> >
>> >> >> > J-D
>> >> >> >
>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]>
>> wrote:
>> >> >> > >
>> >> >> > > Thanks J-D!
>> >> >> > > Since my main table is expected to continue growing I guess at
>> some
>> >> >> point
>> >> >> > > even setting the cache size to 1 will not be enough. Is there a
>> way
>> >> to
>> >> >> > > configure the lease timeout?
>> >> >> > >
>> >> >> > > -eran
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >> [email protected]
>> >> >> > >wrote:
>> >> >> > >
>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected]
>> >
>> >> >> wrote:
>> >> >> > > >
>> >> >> > > > > Hi J-D,
>> >> >> > > > > Thanks for the detailed explanation.
>> >> >> > > > > So if I understand correctly the lease we're talking about
>> is a
>> >> >> > scanner
>> >> >> > > > > lease and the timeout is between two scanner calls, correct?
>> I
>> >> >> think
>> >> >> > that
>> >> >> > > > > make sense because I now realize that jobs that fail (some
>> jobs
>> >> >> > continued
>> >> >> > > > > to
>> >> >> > > > > fail even after reducing the number of map tasks as Stack
>> >> >> suggested)
>> >> >> > use
>> >> >> > > > > filters to fetch relatively few rows out of a very large
>> table,
>> >> so
>> >> >> > they
>> >> >> > > > > could be spending a lot of time on the region server
>> scanning
>> >> rows
>> >> >> > until
>> >> >> > > > it
>> >> >> > > > > reached my setCaching value which was 1000. Setting the
>> caching
>> >> >> value
>> >> >> > to
>> >> >> > > > 1
>> >> >> > > > > seem to allow these job to complete.
>> >> >> > > > > I think it has to be the above, since my rows are small,
>> with
>> >> just
>> >> >> a
>> >> >> > few
>> >> >> > > > > columns and processing them is very quick.
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > Excellent!
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > >
>> >> >> > > > > However, there are still a couple ofw thing I don't
>> understand:
>> >> >> > > > > 1. What is the difference between setCaching and setBatch?
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > * Set the maximum number of values to return for each call to
>> >> next()
>> >> >> > > >
>> >> >> > > > VS
>> >> >> > > >
>> >> >> > > > * Set the number of rows for caching that will be passed to
>> >> scanners.
>> >> >> > > >
>> >> >> > > > The former is useful if you have rows with millions of columns
>> and
>> >> >> you
>> >> >> > > > could
>> >> >> > > > setBatch to get only 1000 of them at a time. You could call
>> that
>> >> >> > intra-row
>> >> >> > > > scanning.
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > > 2. Examining the region server logs more closely than I did
>> >> >> yesterday
>> >> >> > I
>> >> >> > > > see
>> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>> >> leases
>> >> >> > (but
>> >> >> > > > no
>> >> >> > > > > UnknownScannerException), is that expected? You can see an
>> >> excerpt
>> >> >> of
>> >> >> > the
>> >> >> > > > > log from one of the region servers here:
>> >> >> > http://pastebin.com/NLcZTzsY
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > It means that when the server got to process that client
>> request
>> >> and
>> >> >> > > > started
>> >> >> > > > reading from the socket, the client was already gone. Killing
>> a
>> >> >> client
>> >> >> > does
>> >> >> > > > that (or killing a MR that scans), so does
>> SocketTimeoutException.
>> >> >> This
>> >> >> > > > should probably go in the book. We should also print something
>> >> nicer
>> >> >> :)
>> >> >> > > >
>> >> >> > > > J-D
>> >> >> > > >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
>
>
> --
> Numai bine,
> Lucian
>



-- 
Numai bine,
Lucian

Reply via email to