Since I had started a response, but Bill beat me to it, let me reiterate.

The tear-down is more for assuring responsiveness when multiple scans are happening at one time. There's a buffer between TabletServer(s) and the client which (if memory serves) it's filled, the scan session is a candidate to be torn down, and later recreated.

To avoid duplicate work by your Accumulo iterators, the last key the iterators returned is maintained by Accumulo.

For example, if you started a scan with a Range:

(-inf, +inf)

Say you scanned 2000/10000 keys in a table of monotonically increasing Keys where only the row is populated. The buffer was filled, the iterators torn down, and re-created some amount of time later. Instead of getting the (-inf, +inf) range again, you would then get the range:

(2000, +inf)

Meaning, the initial infinite start key would be replaced with a start key which was the last key your previous scan returned, non-inclusive.

In short, it's good practice to try to keep Accumulo iterators from holding on to state in memory, otherwise you may get stuck creating the same in-memory members on your iterators repeatedly. See ACCUMULO-625 for some thoughts about trying to avoid this lost-state issue.

- Josh

On 07/01/2012 05:18 PM, William Slacum wrote:

By iterator stack I am referring to the Accumulo iterators. Resource sharing among scan sessions is implemented by destroying a user scan session and eventually recreating the iterator stack. The new stack is then seek'd to the last key returned by the entire stack. If you were holding some state, such as a set of keys, it would be rebuilt every time the stack is created.

On Jul 1, 2012 5:55 PM, "Sukant Hajra" <[email protected] <mailto:[email protected]>> wrote:

    Excerpts from William Slacum's message of Thu Jun 28 16:04:32
    -0500 2012:
    >
    > You're pretty much on the spot regarding two aspects about the
    current
    > IntersectingIterator:
    >
    > 1- It's not really extensible (there are hooks for building doc IDs,
    > but you still need the same `partition term: docId` key structure)
    > 2- Its main strength is that it can do the merges of sorted lists of
    > doc IDs based on equality expressions (ie, `author=="bob" and
    > day=="20120627"`)
    >
    > Fortunately, the logic isn't very complicated for re-creating the
    > merging stuff. Personally, I think it's easy enough to separate the
    > logic of joining N streams of iterator results from the actual
    > scanning. Unfortunately, this would be left up to you to do at the
    > moment :)
    >
    > You could do range searches by consuming sets of values and sorting
    > all of the docIds in that range by throwing them into a TreeSet.
    That
    > would let you emit doc IDs in a globally sorted order for the given
    > range of terms.

    I understand everything above, I think.  Thanks for the prompt reply.

    > This can get problematic if the range ends up being very large
    because your
    > iterator stack may periodically be destroyed and rebuilt.

    This particular statement confused me.  When you said TreeSet,
    you're talking
    about a straight-forward in-memory collection from java.util or
    similar, right?

    Because I'm confused about which "iterator stack may periodically
    be destroyed
    and rebuilt."  It sounds like we're talking about some garbage
    collection
    specific to Accumulo.  Am I missing something here?

    -Sukant

Reply via email to