[ 
https://issues.apache.org/jira/browse/HBASE-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531746#comment-17531746
 ] 

Bryan Beaudreault commented on HBASE-26997:
-------------------------------------------

The impetus for this issue was to find an hbase 2+ solution to a problem we 
often see in hbase 1.2. In our 1.2-based fork we have a patch to automatically 
re-open scanners when UnknownScannerException is seen in MR jobs, and I'd like 
to remove that.

I wrote up an initial solution here for master branch and in testing realized 
that it's virtually impossible to get an UnknownScannerException in master 
branch. The AsyncClientScanner's prefetching routines will automatically call 
renewLease when prefetching is paused due to cache full. In master branch, 
setCaching just determines the number of rows returned per RPC and 
setMaxResultSize determines the size of the local cache. I tried numerous 
configurations to try to get an UnknownScannerException and never could. This 
is great!

Next I tried applying my patch to branch-2 and was surprised to see that my 
test also passed there without even enabling the feature. Digging in it's 
because error handling has improved a lot since 1.2, so UnknownScannerException 
is automatically retried up until the max number of retries. So it seems an 
additional mitigation for this issue in hbase2 would be to bump max retries up 
a bunch. This could still be problematic though if your use-case really can't 
avoid long periods between next calls. Over the course of a long mapper where 
you might hit UnknownScannerException multiple times, it'd still be possible to 
exceed retries. UnknownScannerExceptions also just add noise to both exception 
metrics and logs, which can cause you to think there are underlying issues when 
there actually aren't.

So I think it makes sense to add this feature to branch-2x only

> Auto renew scanner lease in TableRecordReader
> ---------------------------------------------
>
>                 Key: HBASE-26997
>                 URL: https://issues.apache.org/jira/browse/HBASE-26997
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>
> A common problem with hadoop jobs is when the mapper takes too long to 
> process individual inputs. This is especially problematic with 
> TableInputFormat because if you don't process a scanner.next() batch within 
> the scanner timeout period your job will fail with UnknownScannerException.
> The fix here is usually to reduce Scan.setCaching, so that fewer rows are 
> returned within each batch. This isn't always a great solution because maybe 
> not all batches are uniform in their processing time, or maybe even 
> processing a single row (the smallest caching size) might take a while.
> We can improve this for users by providing a configurable period at which the 
> TableRecordReader will automatically call scanner.renewLease() unless next() 
> was recently called.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to