Re: strategies beyond intersecting iterators?

Josh Elser Sun, 01 Jul 2012 15:27:22 -0700

Since I had started a response, but Bill beat me to it, let me reiterate.

The tear-down is more for assuring responsiveness when multiple scansare happening at one time. There's a buffer between TabletServer(s) andthe client which (if memory serves) it's filled, the scan session is acandidate to be torn down, and later recreated.

To avoid duplicate work by your Accumulo iterators, the last key theiterators returned is maintained by Accumulo.


For example, if you started a scan with a Range:

(-inf, +inf)

Say you scanned 2000/10000 keys in a table of monotonically increasingKeys where only the row is populated. The buffer was filled, theiterators torn down, and re-created some amount of time later. Insteadof getting the (-inf, +inf) range again, you would then get the range:


(2000, +inf)

Meaning, the initial infinite start key would be replaced with a startkey which was the last key your previous scan returned, non-inclusive.

In short, it's good practice to try to keep Accumulo iterators fromholding on to state in memory, otherwise you may get stuck creating thesame in-memory members on your iterators repeatedly. See ACCUMULO-625for some thoughts about trying to avoid this lost-state issue.


- Josh

On 07/01/2012 05:18 PM, William Slacum wrote:

By iterator stack I am referring to the Accumulo iterators. Resourcesharing among scan sessions is implemented by destroying a user scansession and eventually recreating the iterator stack. The new stack isthen seek'd to the last key returned by the entire stack. If you wereholding some state, such as a set of keys, it would be rebuilt everytime the stack is created.

On Jul 1, 2012 5:55 PM, "Sukant Hajra" <[email protected]<mailto:[email protected]>> wrote:


    Excerpts from William Slacum's message of Thu Jun 28 16:04:32
    -0500 2012:
    >
    > You're pretty much on the spot regarding two aspects about the
    current
    > IntersectingIterator:
    >
    > 1- It's not really extensible (there are hooks for building doc IDs,
    > but you still need the same `partition term: docId` key structure)
    > 2- Its main strength is that it can do the merges of sorted lists of
    > doc IDs based on equality expressions (ie, `author=="bob" and
    > day=="20120627"`)
    >
    > Fortunately, the logic isn't very complicated for re-creating the
    > merging stuff. Personally, I think it's easy enough to separate the
    > logic of joining N streams of iterator results from the actual
    > scanning. Unfortunately, this would be left up to you to do at the
    > moment :)
    >
    > You could do range searches by consuming sets of values and sorting
    > all of the docIds in that range by throwing them into a TreeSet.
    That
    > would let you emit doc IDs in a globally sorted order for the given
    > range of terms.

    I understand everything above, I think.  Thanks for the prompt reply.

    > This can get problematic if the range ends up being very large
    because your
    > iterator stack may periodically be destroyed and rebuilt.

    This particular statement confused me.  When you said TreeSet,
    you're talking
    about a straight-forward in-memory collection from java.util or
    similar, right?

    Because I'm confused about which "iterator stack may periodically
    be destroyed
    and rebuilt."  It sounds like we're talking about some garbage
    collection
    specific to Accumulo.  Am I missing something here?

    -Sukant

Re: strategies beyond intersecting iterators?

Reply via email to