Thanks, I had thought about trying this, and it’s good to know it’s a viable 
solution.

However I’m pretty reticent right now to add anymore iterators to our project, 
they’ve been a test nightmare for us internally.
Because of the way our internal process works, at any point in time we have 
many versions of our product running against a subset of tables in a single 
Accumulo instance and at least in 1.6 there doesn’t appear to be a good way to 
have the tablet servers auto reload the iterators when builds are updated (you 
can specify paths to watch, but it doesn't seem to deal with wild cards). Our 
internal servers have literally 100’s of tables which require different 
versions of iterators so they are in differing HDFS paths.

Thanks

Rob Povey


From: Dylan Hutchison <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, October 27, 2015 at 4:35 PM
To: Accumulo User List 
<[email protected]<mailto:[email protected]>>
Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner

Hi Rob,

One solution is to use an Accumulo iterator.  Suppose you want to scan a set of 
non-overlapping ranges R.  Use a (non-batch) Scanner, with range spanning the 
least start key in R to the greatest end key in R, and a server-side iterator 
that works as follows:

  *   Pass R to the server-side iterator via iterator options.
  *   On a call to seek(Range r, ..., ...) in the iterator: let the iterator 
seek its parent for the first range in R that intersects with r.
  *   On a call to next(), if the current seek'ed range is finished, seek its 
parent to the next range in R that intersects with r, until no more ranges in R 
intersect with r.  At that point the scan is finished.

The result is that you can scan a number of non-disjoint ranges with "one 
Scanner call" whose results come back in order.  We did this "moving seek 
control" into the land of iterators.  One word of caution: if the number of 
ranges is very large, you might run into 
ACCUMULO-3710<https://issues.apache.org/jira/browse/ACCUMULO-3710> -- too many 
range objects get materialized at the tablet server which results in an out of 
memory error.

I have implemented something like this in the Graphulo project under 
SeekFilterIterator<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java>
 and its related classes.  Take a look at that if you want to try this idea, 
and feel free to follow up with questions.

Cheers, Dylan




On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey 
<[email protected]<mailto:[email protected]>> wrote:
What I want is something that behaves like a BatchScanner (I.e. Takes a 
collection of Ranges in a single RPC), but preserves the scan ordering.
I understand this would greatly impact performance, but in my case I can 
manually partition my request on the client, and send one request per tablet.
I can’t use scanners, because in some cases I have 10’s of thousands of none 
consecutive ranges.
If I use a single threaded BatchScanner, and only request data from a single 
Tablet, am I guaranteed ordering?
This appears to work correctly in my small tests (albeit slower than a single 1 
thread Batch scanner call), but I don’t really want to have to rely on it if 
the semantic isn’t guaranteed.
If not Is there another “efficient” way to do this.

Thanks

Rob Povey


Reply via email to