Thanks, I had thought about trying this, and it’s good to know it’s a viable solution.
However I’m pretty reticent right now to add anymore iterators to our project, they’ve been a test nightmare for us internally. Because of the way our internal process works, at any point in time we have many versions of our product running against a subset of tables in a single Accumulo instance and at least in 1.6 there doesn’t appear to be a good way to have the tablet servers auto reload the iterators when builds are updated (you can specify paths to watch, but it doesn't seem to deal with wild cards). Our internal servers have literally 100’s of tables which require different versions of iterators so they are in differing HDFS paths. Thanks Rob Povey From: Dylan Hutchison <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, October 27, 2015 at 4:35 PM To: Accumulo User List <[email protected]<mailto:[email protected]>> Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner Hi Rob, One solution is to use an Accumulo iterator. Suppose you want to scan a set of non-overlapping ranges R. Use a (non-batch) Scanner, with range spanning the least start key in R to the greatest end key in R, and a server-side iterator that works as follows: * Pass R to the server-side iterator via iterator options. * On a call to seek(Range r, ..., ...) in the iterator: let the iterator seek its parent for the first range in R that intersects with r. * On a call to next(), if the current seek'ed range is finished, seek its parent to the next range in R that intersects with r, until no more ranges in R intersect with r. At that point the scan is finished. The result is that you can scan a number of non-disjoint ranges with "one Scanner call" whose results come back in order. We did this "moving seek control" into the land of iterators. One word of caution: if the number of ranges is very large, you might run into ACCUMULO-3710<https://issues.apache.org/jira/browse/ACCUMULO-3710> -- too many range objects get materialized at the tablet server which results in an out of memory error. I have implemented something like this in the Graphulo project under SeekFilterIterator<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java> and its related classes. Take a look at that if you want to try this idea, and feel free to follow up with questions. Cheers, Dylan On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey <[email protected]<mailto:[email protected]>> wrote: What I want is something that behaves like a BatchScanner (I.e. Takes a collection of Ranges in a single RPC), but preserves the scan ordering. I understand this would greatly impact performance, but in my case I can manually partition my request on the client, and send one request per tablet. I can’t use scanners, because in some cases I have 10’s of thousands of none consecutive ranges. If I use a single threaded BatchScanner, and only request data from a single Tablet, am I guaranteed ordering? This appears to work correctly in my small tests (albeit slower than a single 1 thread Batch scanner call), but I don’t really want to have to rely on it if the semantic isn’t guaranteed. If not Is there another “efficient” way to do this. Thanks Rob Povey
