William,
Thanks for the quick response.  Let me start by stating what I understand about 
Iterators (to be sure I'm not completely off my rocker).

1. An iterator receives, as its source, another iterator (by way of the init 
method), which becomes it's source of data.
2. When seek is called on an iterator, the iterator should respond by moving 
the pointer to the first key/value that applied to that iterator and is within 
the range
    a. Depending on the iterator, that may not be the first key in the range
    b. Only keys (and their corresponding values) which include one of the 
column families listed in the family list should be available as topKey and 
topValue. (this restriction should continue until seek is called again, meaning 
that subsequent calls to next will only proceed to key/values that also match 
the list provided.
    c. Generally speaking, a seek will result in the iterator calling seek on 
its source iterator (although the parameters passed in may be different)
3. If an iterator needs configuration beyond just the source obtained in the 
init call, it can get that through the options and/or env.
4. Iterators do not necessarily return the same types of key/values as they 
consume.  ie, a Combiner may call next() and getTopValue multiple times each 
time those methods are called on it.  And the value it returns as topKey may be 
a key that doesn't actually exist in the datastore itself.


So my questions:
Is it correct that once seek is called, only topKeys that conform to the 
columnFamilies collection should be returned.  And that this behavior persists 
until seek is called again, even when next has been called?
How do iterators like the OrIterator obtain multiple sources?  (I assume you 
were trying to address that with #3 in your response, but I don't understand 
what you mean by clone()ing the source.  That would give me copies of the one 
source, but not multiple sources)
Why do some iterators have so many constructors if the system will simply 
construct them from the default constructor?
Some iterators (such as OrIterator) throw an exception if init is called.  How 
do these iterators get constructed and initialized?

If OrIterator can do what I'm asking for, how do I get it the "terms" and what 
format do they come in?  You mentioned JEXL expressions, but I haven't seen 
anything about them in the documentation.


As for my statement about the OrIterator and multiple rows, the comments on the 
compareTo for OrIterator.TermSource state "If your implementation can have more 
than one row in a tablet, you must compare row key here first, then column 
qualifier."  But the code does not do so.  It may be that I'm just not fully 
understanding the code, however.

Finally, I'm actually trying to do something a little more complex than just 
what I described below.  This reply is already too long and had too many 
questions in it, but I'll get more detail out after I have a better handle on 
how the iterator framework works.

Thanks,
Tejay

From: William Slacum [mailto:[email protected]]
Sent: Wednesday, August 22, 2012 3:00 PM
To: [email protected]
Subject: EXTERNAL: Re: Custom Iterators

An or clause should be able to handle an enumeration of values, as that's 
supported in a JEXL expression. It would not, however, surprise me if those 
iterators could not handle multiple rows in a tablet. If you can reproduce 
that, please file a ticket. There will be a large update occurring to the Wiki 
example in the near future.

Do you have any specific questions about how you should structure your iterator 
or the contract? Making a tutorial has been on my to do list, but we all know 
how to do lists end up...

The big things to remember are:

1) The call order: Your iterator will be created via the default constructor, 
init() will be called, then seek(). After seek() is called, your iterator 
should have a top if there is data available. A client then can call hasTop(), 
getTopKey() and getTopValue() to check and retrieve data (similar to hasNext() 
and next()) and then next to advance the pointer.

2) Your iterator can be destroyed during a scan and then reconstructed, being 
passed in the last key returned to the client as the start of the range.

3) You can have multiple sources feed into a single iterator in a tree like 
fashion by clone()'ing the source passed in to init.
On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
<[email protected]<mailto:[email protected]>> wrote:
All,
I'm interested in writing a custom iterator, and I've been looking for 
documentation on how to do so.  Thus far, I've not been able to find anything 
beyond the java docs in SortedKeyValueIterator and a few other sub-classes.  A 
few of the examples use Iterators, but provide no real info on how to properly 
implement one.  Is there anywhere to find general guidance on the iterator 
stack?

(If you're interested)
Specifically, for those that are curious, I'm trying to implement something 
similar to the wikisearch example, but with some key differences.  In my case, 
I've got a file with various attributes that being indexed.  So for each file 
there are 5 attributes, and each attribute has a fixed number of possible 
values.  For example (totally made up):
personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

A typical query would be:
Give me the personRecord for all people with:
Gender: male &
Hair color: blond or brown &
Country: USA or England or china or korea &
Race: white or oriental

The existing Iterators used in the wikisearch example are unable to handle the 
"or" clauses in each attribute.
The OrIterator doesn't appear to handle the possibility more than one row per 
tablet

Thanks,
Tejay Cardon

Reply via email to