Hmm...Still doesn't return anything from the shell.
http://pastebin.com/ndRhspf8
Any thoughts? What's the best way to debug these?
On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
<wilhelm.von.cl...@accumulo.net <mailto:wilhelm.von.cl...@accumulo.net>>
wrote:
Ah, an artifact of me just willy nilly writing an iterator :) Any
reference to `this.source` should be replaced with
`this.getSource()`. In `next()`, your workaround ends up calling
`this.hasTop()` as the while loop condition. It will always return
false because two lines up we set `top_key` to null. We need to make
sure that the source iterator has a top, because we want to read
data from it. We'll have to change the loop condition to
`while(this.getSource().hasTop())`. On line 38 of your code we'll
need to call `this.getSource().next()` instead of `this.next()`.
The iterator interface is documented, but there hasn't been a
definitive go-to for making one. I've been drafting a blog post, but
since it doesn't exist yet, hopefully the following will suffice.
The lifetime of an iterator is (usually) as follows:
(1) A new instance is called via Class.newInstance (so a no-args
constructor is needed)
(2) Init is called. This allows users to configure the iterator, set
its source, and possible check the environment. We can also call
`deepCopy` on the source if we want to have multiple sources (we'd
do this if we wanted to do a merge read out of multiple column
families within a row).
(3) seek() is called. This gets our readers to the correct positions
in the data that are within the scan range the user requested, as
well as turning column families on or off. The name should
reminiscent of seeking to some key on disk.
(4) hasTop() is called. If true, that means we have data, and the
iterator has a key/value pair that can be retrieved by calling
getTopKey() and getTopValue(). If fasle, we're done because there's
no data to return.
(5) next() is called. This will attempt find a new top key and
value. We go back to (4) to see if next was successful in finding a
new top key/value and will repeat until the client is satisfied or
hasTop() returns false.
You can kind of make a state machine out of those steps where we
loop between (4) and (5) until there's no data. There are more
advanced workflows where next() can be reading from multiple
sources, as well as seeking them to different positions in the tablet.
On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
<michael.m...@gmail.com <mailto:michael.m...@gmail.com>> wrote:
Thanks, William. I was just hitting you up for an example :)
I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
noticed that "this.source" in your example didn't have
visibility. Did I worked around it correctly?
When I add my iterator to my table and run scan from the shell,
it returns nothing - what should I expect here? In general I've
found the iterator interface pretty confusing and haven't spent
the time wrapping my head around it yet. Any documentation or
examples (beyond what I could find on the site or in the code)
appreciated!
/root@dev> table pojo/
/root@dev pojo> listiter -scan -t pojo/
/-/
/- Iterator counter, scan scope options:/
/- iteratorPriority = 10/
/- iteratorClassName = iterators.Counter/
/-/
/root@dev pojo> scan/
/root@dev pojo>/
Best,
-Mike
On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
<wilhelm.von.cl...@accumulo.net
<mailto:wilhelm.von.cl...@accumulo.net>> wrote:
For a bit of psuedocode, I'd probably make a class that did
something akin to: http://pastebin.com/pKqAeeCR
I wrote that up real quick in a text editor-- it won't
compile or anything, but should point you in the right
direction.
On Mon, Jul 14, 2014 at 3:44 PM, William Slacum
<wilhelm.von.cl...@accumulo.net
<mailto:wilhelm.von.cl...@accumulo.net>> wrote:
Hi Mike!
The Combiner interface is only for aggregating keys
within a single row. You can probably get away with
implementing your combining logic in a WrappingIterator
that reads across all the rows in a given tablet.
To do some combine/fold/reduce operation, Accumulo needs
the input type to be the same as the output type. The
combiner doesn't have a notion of a "present" type (as
you'd see in something like Algebird's Groups), but you
can use another iterator to perform your transformation.
If you wanted to extract the "count" field from your
Avro object, you could write a new Iterator that took
your Avro object, extracted the desired field, and
returned it as its top value. You can then set this
iterator as the source of the aggregator, either
programmatically or via by wrapping the source object
passed to the aggregator in its
SortedKeyValueIterator#init call.
This is a bit inefficient as you'd have to serialize to
a Value and then immediately deserialize it in the
iterator above it. You could mitigate this by exposing a
method that would get the extracted value before
serializing it.
This kind of counting also requires client side logic to
do a final combine operation, since the aggregations
from all the tservers are partial results.
I believe that CountingIterator is not meant for user
consumption, but I do not know if it's related to your
issue in trying to use it from the shell. Iterators set
through the shell, in previous versions of Accumulo,
have a requirement to implement OptionDescriber. Many
default iterators do not implement this, and thus can't
set in the shell.
On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss
<michael.m...@gmail.com <mailto:michael.m...@gmail.com>>
wrote:
Hi, All.
I'm curious what the best practices are around
persisting complex types/data in Accumulo (and
aggregating on fields within them).
Let's say I have (row, column family, column
qualifier, value):
"A" "foo" "" MyHugeAvroObject(count=2)
"A" "foo" "" MyHugeAvroObject(count=3)
Let's say MyHugeAvroObject has a field "Integer
count" with the values above.
What is the best way to aggregate on row, column
family, column qualifier by count? In my above example:
"A" "foo" "" 5
The TypedValueCombiner.typedReduce method can
deserialize any "V", in my case MyHugeAvroObject,
but it needs to return a value of type "V". What are
the best practices for deeply nested/complex
objects? It's not always straightforward to map a
complex Avro type into Row -> Column Family ->
Column Qualifier.
Rather than using a TypedCombiner, I looked into
using an Aggregator (which appears deprecated as of
1.4), which appears to let me return arbitrary
values, but despite running setiter, my aggregator
doesn't seem to do anything.
I also tried looking at implementing a
WrappingIterator, which also appears to allow me to
return arbitary values (such as Accumulo's
CountingIterator), but I get cryptic errors when
trying to setiter, I'm on Accumulo 1.6:
root@dev kyt> setiter -t kyt -scan -p 10 -n
countingIter -class
org.apache.accumulo.core.iterators.system.CountingIterator
2014-07-14 11:12:55,623 [shell.Shell] ERROR:
java.lang.IllegalArgumentException:
org.apache.accumulo.core.iterators.system.CountingIterator
This is odd because other included implementations
of WrappingIterator seem to work (perhaps the
implementation of CountingIterator is dated):
root@dev kyt> setiter -t kyt -scan -p 10 -n
deletingIterator -class
org.apache.accumulo.core.iterators.system.DeletingIterator
The iterator class does not implement
OptionDescriber. Consider this for better iterator
configuration using this setiter command.
Name for iterator (enter to skip):
All in all, how can I aggregate simple values, like
counters from rows with complex Avro objects as
Values without having to add aggregations fields to
these Value objects?
Thanks!
-Mike