Re: EXTERNAL: Re: Custom Iterators

Josh Elser Wed, 22 Aug 2012 15:54:51 -0700

... and I just realized I was looking at the OrIterator in trunk, notcontrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with testcases...


On 08/22/2012 06:44 PM, Josh Elser wrote:

You could compare clone()'ing multiple sources inside of an iteratorto maintaining multiple pointers at different offsets to a file ondisk. The clone()'ed iterators are all operating over the same row;however, they are all pointing at different offsets (keys).
Concretely, the OrIterator is sent a list of terms to union, andclone()'s the source it was given for each term (note the addTerm()method on the class). The OrIterator attempts to find the indexentries for each term, and return the minimum docid to satisfy theSortedKeyValueIterator contract.
Given your comment on the TermSource.compareTo() method's comment(....), yes, it does appear that you have found a bug. That commentabout "multiple rows in a tablet" should really be removed, IMO. It'srather confusing, and shouldn't matter when you're writing aniterator. In other words, you, as a developer, don't need to know whatrows are contained in a tablet. The only issue you need to worry aboutis if you're trying to do some operation *across* rows. Given that allof the index entries for a single document are contained in one row(which happens to just be a bucket in the Wiki application), thispoint is meaningless.
You might also note that the next() method on the OrIterator doesn'tcheck if the new topKey for the term it just advanced is contained inthe current Range before adding it back to the PriorityQueue. Thiscould cause a term who has passed outside of the initial Rangeprovided to seek() to be added unnecessarily to said PriorityQueue.
+2 bugs

On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
William,
Thanks for the quick response. Let me start by stating what Iunderstand about Iterators (to be sure I’m not completely off myrocker).
1. An iterator receives, as its source, another iterator (by way ofthe init method), which becomes it’s source of data.
2. When seek is called on an iterator, the iterator should respond bymoving the pointer to the first key/value that applied to thatiterator and is within the range
a. Depending on the iterator, that may not be the first key in the range
b. Only keys (and their corresponding values) which include one ofthe column families listed in the family list should be available astopKey and topValue. (this restriction should continue until seek iscalled again, meaning that subsequent calls to next will only proceedto key/values that also match the list provided.
c. Generally speaking, a seek will result in the iterator callingseek on its source iterator (although the parameters passed in may bedifferent)
3. If an iterator needs configuration beyond just the source obtainedin the init call, it can get that through the options and/or env.
4. Iterators do not necessarily return the same types of key/valuesas they consume. ie, a Combiner may call next() and getTopValuemultiple times each time those methods are called on it. And thevalue it returns as topKey may be a key that doesn’t actually existin the datastore itself.
So my questions:
Is it correct that once seek is called, only topKeys that conform tothe columnFamilies collection should be returned. And that thisbehavior persists until seek is called again, even when next has beencalled?
How do iterators like the OrIterator obtain multiple sources? (Iassume you were trying to address that with #3 in your response, butI don’t understand what you mean by clone()ing the source. That wouldgive me copies of the one source, but not multiple sources)
Why do some iterators have so many constructors if the system willsimply construct them from the default constructor?
Some iterators (such as OrIterator) throw an exception if init iscalled. How do these iterators get constructed and initialized?
If OrIterator can do what I’m asking for, how do I get it the “terms”and what format do they come in? You mentioned JEXL expressions, butI haven’t seen anything about them in the documentation.
As for my statement about the OrIterator and multiple rows, thecomments on the compareTo for OrIterator.TermSource state “If yourimplementation can have more than one row in a tablet, you mustcompare row key here first, then column qualifier.” But the code doesnot do so. It may be that I’m just not fully understanding the code,however.
Finally, I’m actually trying to do something a little more complexthan just what I described below. This reply is already too long andhad too many questions in it, but I’ll get more detail out after Ihave a better handle on how the iterator framework works.
Thanks,

Tejay

*From:*William Slacum [mailto:[email protected]]
*Sent:* Wednesday, August 22, 2012 3:00 PM
*To:* [email protected]
*Subject:* EXTERNAL: Re: Custom Iterators
An or clause should be able to handle an enumeration of values, asthat's supported in a JEXL expression. It would not, however,surprise me if those iterators could not handle multiple rows in atablet. If you can reproduce that, please file a ticket. There willbe a large update occurring to the Wiki example in the near future.
Do you have any specific questions about how you should structureyour iterator or the contract? Making a tutorial has been on my to dolist, but we all know how to do lists end up...
The big things to remember are:
1) The call order: Your iterator will be created via the defaultconstructor, init() will be called, then seek(). After seek() iscalled, your iterator should have a top if there is data available. Aclient then can call hasTop(), getTopKey() and getTopValue() to checkand retrieve data (similar to hasNext() and next()) and then next toadvance the pointer.
2) Your iterator can be destroyed during a scan and thenreconstructed, being passed in the last key returned to the client asthe start of the range.
3) You can have multiple sources feed into a single iterator in atree like fashion by clone()'ing the source passed in to init.
On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E<[email protected] <mailto:[email protected]>> wrote:
All,
I’m interested in writing a custom iterator, and I’ve been lookingfor documentation on how to do so. Thus far, I’ve not been able tofind anything beyond the java docs in SortedKeyValueIterator and afew other sub-classes. A few of the examples use Iterators, butprovide no real info on how to properly implement one. Is thereanywhere to find general guidance on the iterator stack?
(If you’re interested)
Specifically, for those that are curious, I’m trying to implementsomething similar to the wikisearch example, but with some keydifferences. In my case, I’ve got a file with various attributes thatbeing indexed. So for each file there are 5 attributes, and eachattribute has a fixed number of possible values. For example (totallymade up):
personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank

AND
Row:binID; ColFam:”D”; ColQ:personID; value:personRecord

A typical query would be:

Give me the personRecord for all people with:

Gender: male &

Hair color: blond or brown &

Country: USA or England or china or korea &

Race: white or oriental
The existing Iterators used in the wikisearch example are unable tohandle the “or” clauses in each attribute.
The OrIterator doesn’t appear to handle the possibility more than onerow per tablet
Thanks,

Tejay Cardon

Re: EXTERNAL: Re: Custom Iterators

Reply via email to