Hi Rob, One solution is to use an Accumulo iterator. Suppose you want to scan a set of non-overlapping ranges R. Use a (non-batch) Scanner, with range spanning the least start key in R to the greatest end key in R, and a server-side iterator that works as follows:
- Pass R to the server-side iterator via iterator options. - On a call to seek(Range r, ..., ...) in the iterator: let the iterator seek its parent for the first range in R that intersects with r. - On a call to next(), if the current seek'ed range is finished, seek its parent to the next range in R that intersects with r, until no more ranges in R intersect with r. At that point the scan is finished. The result is that you can scan a number of non-disjoint ranges with "one Scanner call" whose results come back in order. We did this "moving seek control" into the land of iterators. One word of caution: if the number of ranges is very large, you might run into ACCUMULO-3710 <https://issues.apache.org/jira/browse/ACCUMULO-3710> -- too many range objects get materialized at the tablet server which results in an out of memory error. I have implemented something like this in the Graphulo project under SeekFilterIterator <https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java> and its related classes. Take a look at that if you want to try this idea, and feel free to follow up with questions. Cheers, Dylan On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey <[email protected]> wrote: > What I want is something that behaves like a BatchScanner (I.e. Takes a > collection of Ranges in a single RPC), but preserves the scan ordering. > I understand this would greatly impact performance, but in my case I can > manually partition my request on the client, and send one request per > tablet. > I can’t use scanners, because in some cases I have 10’s of thousands of > none consecutive ranges. > If I use a single threaded BatchScanner, and only request data from a > single Tablet, am I guaranteed ordering? > This appears to work correctly in my small tests (albeit slower than a > single 1 thread Batch scanner call), but I don’t really want to have to > rely on it if the semantic isn’t guaranteed. > If not Is there another “efficient” way to do this. > > Thanks > > Rob Povey > >
