Hi, folks,

The StatsCombiner[1] shows one way for an Iterator to distinguish between
processed and unprocessed data. In this case, the StatsCombiner treats
string representations of integers as unprocessed data and comma-separated
string representations of integers as processed data.

Two questions: First, is it possible to do this in an arbitrary fashion?
For example, let's say my Iterator adds Values to a bloom filter which it
maintains internally - like a combiner, but potentially across multiple
CF's. If the iterator encounters unprocessed data, it should offer it to
the bloom filter. If it encounters processed data (ie. a bloom filter), it
should merge it with its own bloom filter.

The only way that I can think of to do this is to have a higher-priority
iterator that "escapes" Values, and have my Iterator emit unescaped Values.
Then my iterator can make decisions based on whether a current Value is or
isn't escaped. I find this approach pretty kludgy though, and any advice is
welcome.

Second question: the need to distinguish between processed and unprocessed
data, is this due to the Iterator running in all three scopes? Would a
per-scanner Iterator or an Iterator running in scan scope be guaranteed to
only see unprocessed data?

Thanks,
-Russ

1:
https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java

Reply via email to