forgot below code: this.scan.setStartRow(this.lastResult.getRow()); we reset scanner to original position, so if we want to retry for OutOfOrderScannerNextException, perhaps one retry would be sufficient(since the exception already indicates do not retry...)
filed HBASE-12266 On Wed, Oct 15, 2014 at 5:41 PM, Qiang Tian <[email protected]> wrote: > James, > thanks for the details, an interesting case. > > It looks to me the OutOfOrderScannerNextException thrown by RS is correct > -- note the internal cursor in scanner has moved on, even if the client > side retries, client will not get the correct data(will lose data that the > scanner already scanned) > the dead loop is caused by "retryAfterOutOfOrderException = true;" at the > last line of the try block. > not sure why it is set true there. but looking at below code it looks it > just want to retry once? > > if (e instanceof OutOfOrderScannerNextException) { > if (retryAfterOutOfOrderException) { > retryAfterOutOfOrderException = false; > } else { > // TODO: Why wrap this in a DNRIOE when it already is a > DNRIOE? > throw new DoNotRetryIOException("Failed after retry of " + > "OutOfOrderScannerNextException: was there a rpc > timeout?", e); > } > } > > > and it looks we throw OutOfOrderScannerNextException here > > } else { > // If exception is any but the list below throw it back to > the client; else setup > // the scanner and retry. > Throwable cause = e.getCause(); > if ((cause != null && cause instanceof > NotServingRegionException) || > (cause != null && cause instanceof > RegionServerStoppedException) || > e instanceof OutOfOrderScannerNextException) { > // Pass > // It is easier writing the if loop test as list of what > is allowed rather than > // as a list of what is not allowed... so if in here, it > means we do not throw. > } else { > throw e; > } > } > > > > > > > On Wed, Oct 15, 2014 at 3:58 AM, James Estes <[email protected]> > wrote: > >> I'm having an issue where a Scan gets in a loop, and never completes and >> never times out. I've seen it run for hours, and is repeatable on my >> system. After looking through hbase code and logs, I think I understand >> what is going on. >> >> I'm using hbase 0.96.0-hadoop2, running on Hadoop 2.2.0. I'm using the >> same >> version of the HBase client. >> >> We do an hourly incremental backup using a simple Java class that sets up >> a >> Scan with a time range. We're only adding about 250,000 cells ~100MB per >> hour across 46 regions on 8 region servers, so there isn't much, but it'll >> be skipping a lot of data (after a compaction). This Scan will regularly >> get 'stuck' where we'll be alerted that it has been running for hours. >> >> Turns out that one of the scan rpc calls is taking a long time (> >> rpcTimeout which we have set to 10s), and the ClientScanner.next call >> loops >> forever retrying because for some reason, when the RpcRetryingCaller >> retries after the timeout, it gets an OutOfOrderScannerNextException which >> the ClientScanner treats as needing to reset the scan. So without another >> break condition, the ClientScanner.next keeps resetting the scanner, only >> to get the exact same result. >> >> Here is a rough sequence: >> 1 - the scan gets near the end of a region, and we process those results >> 2 - then it goes back for more via ClientScanner.next(), and there isn't >> any new data at the end of that region, so the rpc call comes back to the >> client (but no new keys are in the result) >> 3 - then it moves the scan forward to the next region, and goes to fetch >> more data. This time though, this next region has recently been compacted >> and it cannot prune store files by the timestamp range, and has a large >> amount of data to scan through (again filtering by timestamp). >> 4 - We have our rpcTimeout set to 10s. This scan rpc call takes longer >> than >> that b/c of all the data it is churning through. >> 5 - The RetryingRpcHandler gets a timeout exception, and will retry >> 6 - The server almost immediately notices this (usually within 10-20ms). I >> think this indicates that the scan is actively progressing pretty quickly >> through the data because it is able to check and notice the client has >> gone >> away so quickly...there is just a lot of data to get through. >> 7 - The client retries after 100ms pause time (from the >> RetryingRpcHandler). But the server immediately rejects the scan returning >> an OutOfOrderScannerNextException. The client thinks the call >> next_call_seq=0, server says next_call_seq=1. This maybe shouldn't matter >> much, it just means the client would reset the scanner and we've only had >> one extra rpc call here since the ClientScanner treats >> OutOfOrderScannerNextException as a signal to reset the scanner. But there >> is no other condition to stop the loop via a timeout or retry count. >> 8 - So the client side resets the scanner back to the last key from the >> previous results, and goes to fetch data. And we are back to #1. Right >> here, there seems to be no check to decide if the ClientScanner.next has >> taken too long overall. >> >> Here are some annotated logs: >> https://gist.github.com/housejester/a0a530279eaa868db877 >> >> Increasing our rpcTimeout fixes the issue, since we give the Scan enough >> time to complete (it didn't take much, 12s was fine, but we've bumped it >> up >> to 60s). However, it seems like the ClientScanner.next should either have >> a >> timeout or max reset count? Maybe rpcTimeout * retries? I think if the >> OutOfOrderScannerNextException didn't occur, then the normal timeout path >> would have worked. >> >> James >> > >
