Eugene Cheipesh wrote:
That is the idea I am playing with, optimizing subsequent calls to .next
vs .reseek.

Using the VersioningIterator as an I was able to get reseek working, the
trick turned out to be to check getSource.hasTop a little more
carefully. Thank you for that pointer. Disappointingly enough there does
not appear to be a huge difference in performance from a Filter that
performs the same checking without seeking forward.

seek()'ing doesn't always imply an increase in performance -- remember that RFiles (the files that back Accumulo tables), are composed of multiple blocks/sections with an index of them. A seek is comprised of using that index to find the block/section of the RFile and then a linear scan forward to find the first key for the range you seek()'ed to.

Thus, if you're repeatedly re-seek()'ing within the same block, you'll waste a lot of time re-read the same data. In your situation, it sounds like the cost of re-reading the data after a seek is about the same as naively consuming the records.

You can try altering table.file.compress.blocksize (and then compacting your table) to see how this changes.

I am attempting to use accumulo tracing to come up with an explanation
but results are spotty. Query runtime is ~2.3s for 200k entries and I am
only capturing tiny fraction of that through tracing.

You mean that your iterator isn't accounting for a majority of the execution time? Are you bounded by the time it takes to read the data from disk?

I am using
AccumuloInputFormat to pull results into a spark job. Would I get better
results if I was creating the BatchScanner directly? Otherwise what
would be the best method to debug the iterator and query performance?

Are there 200k entries in the table or are you extracting 200k out of N entries? Hooking into tracing is definitely the right approach to take to get perf information.

--
Eugene Cheipesh

From: Russ Weeks <rwe...@newbrightidea.com>
<mailto:rwe...@newbrightidea.com>
Reply: user@accumulo.apache.org <user@accumulo.apache.org>>
<mailto:user@accumulo.apache.org>
Date: January 9, 2015 at 11:32:13 PM
To: user@accumulo.apache.org <user@accumulo.apache.org>>
<mailto:user@accumulo.apache.org>
Subject: Re: Seeking Iterator

On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ctubb...@apache.org
<mailto:ctubb...@apache.org>> wrote:

    Another optimization you can try: instead of always seeking to the
    computed next, you can advance internally inside your iterator by
    calling its source's next method a few times. If you don't reach
    the next element that you would have seek'd to in some reasonable
    number of iterations, you can then seek. This also is a strategy
    that is hard to optimize: Do I need to advance, on average 3 or 20
    or 10000000 keys? How many before it would have been more
    efficient to just seek? There's no easy answer. Experimentation helps.


The VersioningIterator has a good example of this approach:
https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95

-Russ



    --
    Christopher L Tubbs II
    http://gravatar.com/ctubbsii

    On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh
    <echeip...@gmail.com <mailto:echeip...@gmail.com>> wrote:

        That’s would work well enough and is my next choice.

        The thought was, rows are stored in increasing order, so as
        long as I know when I walked off the edge, and flag the
        iterator as empty it’d be good. I’m just chasing the optimal
        in this case, but if it doesn’t exist, oh well.

        Thank you for the reference link, it’s very helpful.

        --
        Eugene Cheipesh

        From: Russ Weeks <rwe...@newbrightidea.com>
        <mailto:rwe...@newbrightidea.com>
        Reply: user@accumulo.apache.org
        <mailto:user@accumulo.apache.org> <user@accumulo.apache.org>>
        <mailto:user@accumulo.apache.org>
        Date: January 9, 2015 at 6:48:47 PM
        To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
        <user@accumulo.apache.org>> <mailto:user@accumulo.apache.org>
        Subject: Re: Seeking Iterator

        Hi, Eugene,

        I think the conventional approach is to decompose your search
        area (bounding box?) into a set of scan ranges that introduce
        minimal extraneous curve segments, and then pass all those
        scan ranges into a BatchScanner. The excellent Accumulo
        Recipes site has an example[1]. Does this approach not work
        for you?

        In general, your custom iterator should never try to seek to
        a row id different from the current row id, because that row
        could be hosted by a different tablet server.

        -Russ

        1:
        
https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33

        On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh
        <echeip...@gmail.com <mailto:echeip...@gmail.com>> wrote:

            Hello,

            I am attempting to write an Iterator based on a Z-curve
            index to search through multi-dimensional data.
            Essentially, given a record that I have encountered that
            is in the index range not in the multi-demensional query
            range I have a way to generate the next candidate record,
            potentially far ahead of the current point.

            Ideally I would be able to refine my search range with
            subsequent calls to seek(). It appears that Accumulo will
            create an iterator for every RFile (or some split other
            split point). The beginning of the range argument to seek
            will be the record at beginning of this split (which is
            good), however all instances of the iterator have the
            same, global range end (which is bad).

            I need to avoid the case where I seek past the range
            boundary of each individual iterator instance and throw a
            NullPointerException. Is there any way to get enough
            information to achieve this?

            Thank you,

            --
            Eugene Cheipesh




Reply via email to