Hi,
Do you mind taking a look at HBASE-6071 ?

It was submitted as a result of this mail (back at May)
http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E

I've recently submitted logs that (I think) confirms this theory.

Thanks,
Igal.

On Thu, Sep 20, 2012 at 4:55 PM, Harsh J <[email protected]> wrote:

> Hi Daniel,
>
> That sounds fine to do (easier a solution, my brain's gotten complex today
> ha).
>
> We should classify the two types of error in the docs for users the
> way you have here, to indicate what the issue is in each of the error
> cases - UnknownScannerException and LeaseException. Mind filing a
> JIRA? :)
>
> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <[email protected]>
> wrote:
> > Thaaank you! I was waiting for this email for months. I've read all the
> > posts regarding lease timeouts and see that people usually have them for
> 2
> > reasons. One, the normal case where the client app does not process the
> row
> > fast enough so they get UnknownScannerException and some had the issue
> below
> > and get LeaseException instead.
> >
> > How about using a try/catch for the
> >
> > // Remove lease while its being processed in server; protects against
> case
> >       // where processing of request takes > lease expiration time.
> >       lease = this.leases.removeLease(scannerName);
> >
> > and re-throw an IllegalStateException or log a warning message because a
> > client with and active scanner but no lease does not seem to be in the
> right
> > state?
> >
> > Just an idea but you know  better.
> > Daniel
> >
> > On 09/20/2012 03:42 PM, Harsh J wrote:
> >
> > Hi,
> >
> > I hit this today and got down to investigate it and one of my
> > colleagues discovered this thread. Since I got some more clues, I
> > thought I'll bump up this thread for good.
> >
> > Lucian almost got the issue here. The thing we missed thinking about
> > is the client retry. The client of HBaseRPC seems to silently retry on
> > timeouts. So if you apply Lucian's theory below and apply that a
> > client retry calls next(ID, Rows) yet again, you can construct this
> > issue:
> >
> > - Client calls next(ID, Rows) first time.
> > - RS receives the handler-sent request, removes lease (to not expire
> > it during next() call) and begins work.
> > - RS#next hangs during work (for whatever reason we can assume - large
> > values or locks or whatever)
> > - Client times out after a minute, retries (due to default nature).
> > Retry seems to be silent though?
> > - New next(ID, Rows) call is invoked. Scanner still exists so no
> > UnknownScanner is thrown. But when next() tries to remove lease, we
> > get thrown LeaseException (and the client gets this immediately and
> > dies) as the other parallel handler has the lease object already
> > removed and held in its stuck state.
> > - A few secs/mins later, the original next() unfreezes, adds back
> > lease to the queue, tries to write back response, runs into
> > ClosedChannelException as the client had already thrown its original
> > socket away. End of clients.
> > - Lease-period expiry later, the lease is now formally removed without
> > any hitches.
> >
> > Ideally, to prevent this, the rpc.timeout must be > lease period as
> > was pointed out. Since in that case, we'd have waited for X units more
> > for the original next() to unblock and continue itself and not have
> > retried. That is how this is avoided, unintentionally, but can still
> > happen if the next() still takes very long.
> >
> > I haven't seen a LeaseException in any other case so far, so maybe we
> > can improve that exception's message to indicate whats going on in
> > simpler terms so clients can reconfigure to fix themselves?
> >
> > Also we could add in some measures to prevent next()-duping, as that
> > is never bound to work given the lease-required system. Perhaps when
> > the next() stores the removed lease, we can store it somewhere global
> > (like ActiveLeases or summat) and deny next() duping if their
> > requested lease is already in ActiveLeases? Just ends up giving a
> > better message, not a solution.
> >
> > Hope this helps others who've run into the same issue.
> >
> > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
> > <[email protected]> wrote:
> >
> > So you should see the SocketTimeoutException in your *client* logs (in
> > your case, mappers), not LeaseException. At this point yes you're
> > going to timeout, but if you spend so much time cycling on the server
> > side then you shouldn't set a high caching configuration on your
> > scanner as IO isn't your bottle neck.
> >
> > J-D
> >
> > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> > <[email protected]> wrote:
> >
> > Hi,
> >
> > The servers have been restarted (I have this configuration for more than
> a
> > month, so this is not the problem).
> > About the stack traces, they show exactly the same, a lot of
> > ClosedChannelConnections and LeaseExceptions.
> >
> > But I found something that could be the problem: hbase.rpc.timeout . This
> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> > could happen the next way:
> > - the mapper makes a scanner.next call to the region server
> > - the region servers needs more than 60 seconds to execute it (I use
> > multiple filters, and it could take a lot of time)
> > - the scan client gets the timeout and cuts the connection
> > - the region server tries to send the results to the client ==>
> > ClosedChannelConnection
> >
> > I will get a deeper look into it tomorrow. If you have other suggestions,
> > please let me know!
> >
> > Thanks,
> > Lucian
> >
> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
> > <[email protected]>wrote:
> >
> > Did you restart the region servers after changing the config?
> >
> > Are you sure it's the same exception/stack trace?
> >
> > J-D
> >
> > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> > <[email protected]> wrote:
> >
> > Hi all,
> >
> > I have exactly the same problem that Eran had.
> > But there is something I don't understand: in my case, I have set the
> >
> > lease
> >
> > time to 240000 (4 minutes). But most of the map tasks that are failing
> >
> > run
> >
> > about 2 minutes. How is it possible to get a LeaseException if the task
> >
> > runs
> >
> > less than the configured time for a lease?
> >
> > Regards,
> > Lucian Iordache
> >
> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <[email protected]> wrote:
> >
> > Perfect! Thanks.
> >
> > -eran
> >
> >
> >
> > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <[email protected]
> >
> > wrote:
> >
> > hbase.regionserver.lease.period
> >
> > Set it bigger than 60000.
> >
> > J-D
> >
> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <[email protected]> wrote:
> >
> > Thanks J-D!
> > Since my main table is expected to continue growing I guess at some
> >
> > point
> >
> > even setting the cache size to 1 will not be enough. Is there a way
> >
> > to
> >
> > configure the lease timeout?
> >
> > -eran
> >
> >
> >
> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> >
> > [email protected]
> >
> > wrote:
> >
> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <[email protected]>
> >
> > wrote:
> >
> > Hi J-D,
> > Thanks for the detailed explanation.
> > So if I understand correctly the lease we're talking about is a
> >
> > scanner
> >
> > lease and the timeout is between two scanner calls, correct? I
> >
> > think
> >
> > that
> >
> > make sense because I now realize that jobs that fail (some jobs
> >
> > continued
> >
> > to
> > fail even after reducing the number of map tasks as Stack
> >
> > suggested)
> >
> > use
> >
> > filters to fetch relatively few rows out of a very large table,
> >
> > so
> >
> > they
> >
> > could be spending a lot of time on the region server scanning
> >
> > rows
> >
> > until
> >
> > it
> >
> > reached my setCaching value which was 1000. Setting the caching
> >
> > value
> >
> > to
> >
> > 1
> >
> > seem to allow these job to complete.
> > I think it has to be the above, since my rows are small, with
> >
> > just
> >
> > a
> >
> > few
> >
> > columns and processing them is very quick.
> >
> > Excellent!
> >
> >
> > However, there are still a couple ofw thing I don't understand:
> > 1. What is the difference between setCaching and setBatch?
> >
> > * Set the maximum number of values to return for each call to
> >
> > next()
> >
> > VS
> >
> > * Set the number of rows for caching that will be passed to
> >
> > scanners.
> >
> > The former is useful if you have rows with millions of columns and
> >
> > you
> >
> > could
> > setBatch to get only 1000 of them at a time. You could call that
> >
> > intra-row
> >
> > scanning.
> >
> >
> > 2. Examining the region server logs more closely than I did
> >
> > yesterday
> >
> > I
> >
> > see
> >
> > a log of ClosedChannelExceptions in addition to the expired
> >
> > leases
> >
> > (but
> >
> > no
> >
> > UnknownScannerException), is that expected? You can see an
> >
> > excerpt
> >
> > of
> >
> > the
> >
> > log from one of the region servers here:
> >
> > http://pastebin.com/NLcZTzsY
> >
> > It means that when the server got to process that client request
> >
> > and
> >
> > started
> > reading from the socket, the client was already gone. Killing a
> >
> > client
> >
> > does
> >
> > that (or killing a MR that scans), so does SocketTimeoutException.
> >
> > This
> >
> > should probably go in the book. We should also print something
> >
> > nicer
> >
> > :)
> >
> > J-D
> >
> >
> >
> > --
> > Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Reply via email to