And I'm actually looking at the OrIterator in 1.4.1. I really need to pull trunk just for the additional insights it may give me, but ultimately I'll be running on the 1.4.1 release.
Tejay -----Original Message----- From: Josh Elser [mailto:[email protected]] Sent: Wednesday, August 22, 2012 5:55 PM To: [email protected] Subject: Re: EXTERNAL: Re: Custom Iterators ... and I just realized I was looking at the OrIterator in trunk, not contrib/wikisearch x.x Still, I think most of my comments still apply. Should verify with test cases... On 08/22/2012 06:44 PM, Josh Elser wrote: > You could compare clone()'ing multiple sources inside of an iterator > to maintaining multiple pointers at different offsets to a file on > disk. The clone()'ed iterators are all operating over the same row; > however, they are all pointing at different offsets (keys). > > Concretely, the OrIterator is sent a list of terms to union, and > clone()'s the source it was given for each term (note the addTerm() > method on the class). The OrIterator attempts to find the index > entries for each term, and return the minimum docid to satisfy the > SortedKeyValueIterator contract. > > Given your comment on the TermSource.compareTo() method's comment > (....), yes, it does appear that you have found a bug. That comment > about "multiple rows in a tablet" should really be removed, IMO. It's > rather confusing, and shouldn't matter when you're writing an > iterator. In other words, you, as a developer, don't need to know what > rows are contained in a tablet. The only issue you need to worry about > is if you're trying to do some operation *across* rows. Given that all > of the index entries for a single document are contained in one row > (which happens to just be a bucket in the Wiki application), this > point is meaningless. > > You might also note that the next() method on the OrIterator doesn't > check if the new topKey for the term it just advanced is contained in > the current Range before adding it back to the PriorityQueue. This > could cause a term who has passed outside of the initial Range > provided to seek() to be added unnecessarily to said PriorityQueue. > > +2 bugs > > On 08/22/2012 05:22 PM, Cardon, Tejay E wrote: >> >> William, >> >> Thanks for the quick response. Let me start by stating what I >> understand about Iterators (to be sure I'm not completely off my >> rocker). >> >> 1. An iterator receives, as its source, another iterator (by way of >> the init method), which becomes it's source of data. >> >> 2. When seek is called on an iterator, the iterator should respond by >> moving the pointer to the first key/value that applied to that >> iterator and is within the range >> >> a. Depending on the iterator, that may not be the first key in the >> range >> >> b. Only keys (and their corresponding values) which include one of >> the column families listed in the family list should be available as >> topKey and topValue. (this restriction should continue until seek is >> called again, meaning that subsequent calls to next will only proceed >> to key/values that also match the list provided. >> >> c. Generally speaking, a seek will result in the iterator calling >> seek on its source iterator (although the parameters passed in may be >> different) >> >> 3. If an iterator needs configuration beyond just the source obtained >> in the init call, it can get that through the options and/or env. >> >> 4. Iterators do not necessarily return the same types of key/values >> as they consume. ie, a Combiner may call next() and getTopValue >> multiple times each time those methods are called on it. And the >> value it returns as topKey may be a key that doesn't actually exist >> in the datastore itself. >> >> So my questions: >> >> Is it correct that once seek is called, only topKeys that conform to >> the columnFamilies collection should be returned. And that this >> behavior persists until seek is called again, even when next has been >> called? >> >> How do iterators like the OrIterator obtain multiple sources? (I >> assume you were trying to address that with #3 in your response, but >> I don't understand what you mean by clone()ing the source. That would >> give me copies of the one source, but not multiple sources) >> >> Why do some iterators have so many constructors if the system will >> simply construct them from the default constructor? >> >> Some iterators (such as OrIterator) throw an exception if init is >> called. How do these iterators get constructed and initialized? >> >> If OrIterator can do what I'm asking for, how do I get it the "terms" >> and what format do they come in? You mentioned JEXL expressions, but >> I haven't seen anything about them in the documentation. >> >> As for my statement about the OrIterator and multiple rows, the >> comments on the compareTo for OrIterator.TermSource state "If your >> implementation can have more than one row in a tablet, you must >> compare row key here first, then column qualifier." But the code does >> not do so. It may be that I'm just not fully understanding the code, >> however. >> >> Finally, I'm actually trying to do something a little more complex >> than just what I described below. This reply is already too long and >> had too many questions in it, but I'll get more detail out after I >> have a better handle on how the iterator framework works. >> >> >> Thanks, >> >> Tejay >> >> *From:*William Slacum [mailto:[email protected]] >> *Sent:* Wednesday, August 22, 2012 3:00 PM >> *To:* [email protected] >> *Subject:* EXTERNAL: Re: Custom Iterators >> >> An or clause should be able to handle an enumeration of values, as >> that's supported in a JEXL expression. It would not, however, >> surprise me if those iterators could not handle multiple rows in a >> tablet. If you can reproduce that, please file a ticket. There will >> be a large update occurring to the Wiki example in the near future. >> >> Do you have any specific questions about how you should structure >> your iterator or the contract? Making a tutorial has been on my to do >> list, but we all know how to do lists end up... >> >> The big things to remember are: >> >> 1) The call order: Your iterator will be created via the default >> constructor, init() will be called, then seek(). After seek() is >> called, your iterator should have a top if there is data available. A >> client then can call hasTop(), getTopKey() and getTopValue() to check >> and retrieve data (similar to hasNext() and next()) and then next to >> advance the pointer. >> >> 2) Your iterator can be destroyed during a scan and then >> reconstructed, being passed in the last key returned to the client as >> the start of the range. >> >> 3) You can have multiple sources feed into a single iterator in a >> tree like fashion by clone()'ing the source passed in to init. >> >> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E >> <[email protected] <mailto:[email protected]>> wrote: >> >> All, >> >> I'm interested in writing a custom iterator, and I've been looking >> for documentation on how to do so. Thus far, I've not been able to >> find anything beyond the java docs in SortedKeyValueIterator and a >> few other sub-classes. A few of the examples use Iterators, but >> provide no real info on how to properly implement one. Is there >> anywhere to find general guidance on the iterator stack? >> >> (If you're interested) >> >> Specifically, for those that are curious, I'm trying to implement >> something similar to the wikisearch example, but with some key >> differences. In my case, I've got a file with various attributes that >> being indexed. So for each file there are 5 attributes, and each >> attribute has a fixed number of possible values. For example (totally >> made up): >> >> personID, gender, hair color, country, race, personRecord >> >> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank >> >> AND >> Row:binID; ColFam:"D"; ColQ:personID; value:personRecord >> >> A typical query would be: >> >> Give me the personRecord for all people with: >> >> Gender: male & >> >> Hair color: blond or brown & >> >> Country: USA or England or china or korea & >> >> Race: white or oriental >> >> The existing Iterators used in the wikisearch example are unable to >> handle the "or" clauses in each attribute. >> >> The OrIterator doesn't appear to handle the possibility more than one >> row per tablet >> >> Thanks, >> >> Tejay Cardon >>
