Avro) Values

Josh Elser Mon, 14 Jul 2014 14:49:30 -0700

A quick sanity check is to make sure you have data in the table and thatyou can read the data without your iterator (I've thought I had a bugbecause I didn't have proper visibilities more times than I'd like toadmit).

Alternatively, you can also enable remote-debugging via Eclipse into theTabletServer which might help you understand more of what's going on.

Lots of articles on how to set this up [1]. In short, add -Xdebug-Xrunjdwp:transport=dt_socket,server=y,address=8000 toACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connecteclipse to 8000 via the Debug configuration menu, set a breakpoint inyour init, seek and next methods, and `scan` in the shell.

[1]http://javarevisited.blogspot.com/2011/02/how-to-setup-remote-debugging-in.html


On 7/14/14, 5:33 PM, Michael Moss wrote:

Hmm...Still doesn't return anything from the shell.

http://pastebin.com/ndRhspf8

Any thoughts? What's the best way to debug these?


On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
<wilhelm.von.cl...@accumulo.net <mailto:wilhelm.von.cl...@accumulo.net>>
wrote:

    Ah, an artifact of me just willy nilly writing an iterator :) Any
    reference to `this.source` should be replaced with
    `this.getSource()`. In `next()`, your workaround ends up calling
    `this.hasTop()` as the while loop condition. It will always return
    false because two lines up we set `top_key` to null. We need to make
    sure that the source iterator has a top, because we want to read
    data from it. We'll have to change the loop condition to
    `while(this.getSource().hasTop())`. On line 38 of your code we'll
    need to call `this.getSource().next()` instead of `this.next()`.

    The iterator interface is documented, but there hasn't been a
    definitive go-to for making one. I've been drafting a blog post, but
    since it doesn't exist yet, hopefully the following will suffice.

    The lifetime of an iterator is (usually) as follows:

    (1) A new instance is called via Class.newInstance (so a no-args
    constructor is needed)
    (2) Init is called. This allows users to configure the iterator, set
    its source, and possible check the environment. We can also call
    `deepCopy` on the source if we want to have multiple sources (we'd
    do this if we wanted to do a merge read out of multiple column
    families within a row).
    (3) seek() is called. This gets our readers to the correct positions
    in the data that are within the scan range the user requested, as
    well as turning column families on or off. The name should
    reminiscent of seeking to some key on disk.
    (4) hasTop() is called. If true, that means we have data, and the
    iterator has a key/value pair that can be retrieved by calling
    getTopKey() and getTopValue(). If fasle, we're done because there's
    no data to return.
    (5) next() is called. This will attempt find a new top key and
    value. We go back to (4) to see if next was successful in finding a
    new top key/value and will repeat until the client is satisfied or
    hasTop() returns false.

    You can kind of make a state machine out of those steps where we
    loop between (4) and (5) until there's no data. There are more
    advanced workflows where next() can be reading from multiple
    sources, as well as seeking them to different positions in the tablet.


    On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
    <michael.m...@gmail.com <mailto:michael.m...@gmail.com>> wrote:

        Thanks, William. I was just hitting you up for an example :)

        I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
        noticed that "this.source" in your example didn't have
        visibility. Did I worked around it correctly?

        When I add my iterator to my table and run scan from the shell,
        it returns nothing - what should I expect here? In general I've
        found the iterator interface pretty confusing and haven't spent
        the time wrapping my head around it yet. Any documentation or
        examples (beyond what I could find on the site or in the code)
        appreciated!

        /root@dev> table pojo/
        /root@dev pojo> listiter -scan -t pojo/
        /-/
        /-    Iterator counter, scan scope options:/
        /-        iteratorPriority = 10/
        /-        iteratorClassName = iterators.Counter/
        /-/
        /root@dev pojo> scan/
        /root@dev pojo>/

        Best,

        -Mike




        On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
        <wilhelm.von.cl...@accumulo.net
        <mailto:wilhelm.von.cl...@accumulo.net>> wrote:

            For a bit of psuedocode, I'd probably make a class that did
            something akin to: http://pastebin.com/pKqAeeCR

            I wrote that up real quick in a text editor-- it won't
            compile or anything, but should point you in the right
            direction.


            On Mon, Jul 14, 2014 at 3:44 PM, William Slacum
            <wilhelm.von.cl...@accumulo.net
            <mailto:wilhelm.von.cl...@accumulo.net>> wrote:

                Hi Mike!

                The Combiner interface is only for aggregating keys
                within a single row. You can probably get away with
                implementing your combining logic in a WrappingIterator
                that reads across all the rows in a given tablet.

                To do some combine/fold/reduce operation, Accumulo needs
                the input type to be the same as the output type. The
                combiner doesn't have a notion of a "present" type (as
                you'd see in something like Algebird's Groups), but you
                can use another iterator to perform your transformation.

                If you wanted to extract the "count" field from your
                Avro object, you could write a new Iterator that took
                your Avro object, extracted the desired field, and
                returned it as its top value. You can then set this
                iterator as the source of the aggregator, either
                programmatically or via by wrapping the source object
                passed to the aggregator in its
                SortedKeyValueIterator#init call.

                This is a bit inefficient as you'd have to serialize to
                a Value and then immediately deserialize it in the
                iterator above it. You could mitigate this by exposing a
                method that would get the extracted value before
                serializing it.

                This kind of counting also requires client side logic to
                do a final combine operation, since the aggregations
                from all the tservers are partial results.

                I believe that CountingIterator is not meant for user
                consumption, but I do not know if it's related to your
                issue in trying to use it from the shell. Iterators set
                through the shell, in previous versions of Accumulo,
                have a requirement to implement OptionDescriber. Many
                default iterators do not implement this, and thus can't
                set in the shell.



                On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss
                <michael.m...@gmail.com <mailto:michael.m...@gmail.com>>
                wrote:

                    Hi, All.

                    I'm curious what the best practices are around
                    persisting complex types/data in Accumulo (and
                    aggregating on fields within them).

                    Let's say I have (row, column family, column
                    qualifier, value):
                    "A" "foo" "" MyHugeAvroObject(count=2)
                    "A" "foo" "" MyHugeAvroObject(count=3)

                    Let's say MyHugeAvroObject has a field "Integer
                    count" with the values above.

                    What is the best way to aggregate on row, column
                    family, column qualifier by count? In my above example:
                    "A" "foo" "" 5

                    The TypedValueCombiner.typedReduce method can
                    deserialize any "V", in my case MyHugeAvroObject,
                    but it needs to return a value of type "V". What are
                    the best practices for deeply nested/complex
                    objects? It's not always straightforward to map a
                    complex Avro type into Row -> Column Family ->
                    Column Qualifier.

                    Rather than using a TypedCombiner, I looked into
                    using an Aggregator (which appears deprecated as of
                    1.4), which appears to let me return arbitrary
                    values, but despite running setiter, my aggregator
                    doesn't seem to do anything.

                    I also tried looking at implementing a
                    WrappingIterator, which also appears to allow me to
                    return arbitary values (such as Accumulo's
                    CountingIterator), but I get cryptic errors when
                    trying to setiter, I'm on Accumulo 1.6:

                    root@dev kyt> setiter -t kyt -scan -p 10 -n
                    countingIter -class
                    org.apache.accumulo.core.iterators.system.CountingIterator
                    2014-07-14 11:12:55,623 [shell.Shell] ERROR:
                    java.lang.IllegalArgumentException:
                    org.apache.accumulo.core.iterators.system.CountingIterator

                    This is odd because other included implementations
                    of WrappingIterator seem to work (perhaps the
                    implementation of CountingIterator is dated):
                    root@dev kyt> setiter -t kyt -scan -p 10 -n
                    deletingIterator -class
                    org.apache.accumulo.core.iterators.system.DeletingIterator
                    The iterator class does not implement
                    OptionDescriber. Consider this for better iterator
                    configuration using this setiter command.
                    Name for iterator (enter to skip):

                    All in all, how can I aggregate simple values, like
                    counters from rows with complex Avro objects as
                    Values without having to add aggregations fields to
                    these Value objects?

                    Thanks!

                    -Mike

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

Reply via email to