I am trying a simple two steps strategy in which first an iterator looks for all the unique values of a property that fall inside the queried range, and then if the number of unique values overcomes a pre-defined threshold I give up with the index and I go with a full scan over the data, otherwise another set of iterators compute the intersection and the union on the index using the values retrieved by the previous iterator. In this way, if I have a query like A ∈ [500,000 - 600,000] and the seek range it is fairly small, I may end up with just few unique values for property A. The wikisearch example inspired to me this approach. In particular I am looking at the UniqFieldNameValueIterator for implementing the first iterator, although I am not sure it works correctly. Has anybody ever played with it? Thanks.
Regards, Max From: Christopher <ctubb...@apache.org> To: user@accumulo.apache.org, "Kepner, Jeremy - 0553 - MITLL" <kep...@ll.mit.edu> Date: 10/01/2017 04:46 Subject: Re: is there any "trick" to save the state of an iterator? FWIW, there is an open pull request on that issue that puts the work very near to completion. It could probably use a bit more testing and review, though. On Mon, Jan 9, 2017 at 9:37 PM Josh Elser <josh.el...@gmail.com> wrote: And yet, Accumulo still doesn't have the API to safely do it. See ACCUMULO-1280 if you'd like to contribute towards to those efforts for the community. On Jan 9, 2017 20:23, "Jeremy Kepner" <kep...@ll.mit.edu> wrote: It's done in D4M (d4m.mit.edu), you might look there. Dylan can explain (if necessary). Regards. -Jeremy On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote: > Great. Glad I wasn't derailing things :) > > Unfortunately, I don't think this is a very well-documented area of the > code (it's quite advanced and would just confuse most users). > > I'll have to think about it some more and see if I can come up with > anything clever. I know there are some others subscribed to this list > who might be more clever than I am -- I'm sure they'll weigh in if they > have any suggestions. > > Finally, if you're interested in helping us put together some sort of > "advanced indexing" docs for the project, I'm sure we could find a few > people who would be happy to get something published on the Accumulo > website. > > Massimilian Mattetti wrote: > > Thank you for your answer John, you understood perfectly what my use > > case is. > > > > The possible solutions that you propose came to mind to me, too. This > > confirms to me that, unfortunately, there is no fancy way to overcome > > this problem. > > > > Is there any good documentation on different query planning for Accumulo > > that could help with my use case? > > Thanks. > > > > Regards, > > Max > > > > > > > > > > From: Josh Elser <josh.el...@gmail.com> > > To: user@accumulo.apache.org > > Date: 09/01/2017 21:55 > > Subject: Re: is there any "trick" to save the state of an iterator? > > ------------------------------------------------------------------------ > > > > > > > > Hey Max, > > > > There is no provided mechanism to do this, and this is a problem with > > supporting "range queries". I'm hoping I'm understanding your use-case > > correctly; sorry in advance if I'm going off on a tangent. > > > > When performing the standard sort-merge join across some columns to > > implement intersections and unions, the un-sorted range of values you > > want to scan over (500k-600k) breaks the ordering of the docIds which > > you are trying to catch. > > > > The trivial solution is to convert a range into a union of discrete > > values (500000 || 500001 || 500002 || ..) but you can see how this > > quickly falls apart. An inverted index could be used to enumerate the > > values that exist in the range. > > > > Another trivial solution would be to select all records matching the > > smaller condition, and then post-filter the other condition. > > > > There might be some more trickier query planning decisions you could > > also experiment with (I'd have to give it lots more thought). In short, > > I'd recommend against trying to solve the problem via saving state. > > Architecturally, this is just not something that Accumulo Iterators are > > designed to support at this time. > > > > - Josh > > > > Massimilian Mattetti wrote: > > > Hi all, > > > > > > I am working with a Document-Partitioned Index table whose index > > > sections are accessed using ranges over the indexed properties (e.g. > > > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.). The > > > iterator that handles this table works by: 1st - calculating (doing > > > intersection and union on different properties) all the result from the > > > index section of a single bin; 2nd - using the ids retrieved from the > > > index, it goes over the data section of the specific bin. > > > This iterator has proved to have significant performance penalty > > > whenever the amount of data retrieved from the index is orders of > > > magnitude bigger than the table_scan_max_memory i.e. the iterator is > > > teardown tens of times for each bin. Since there is no explicit way to > > > save the state of an iterator, is there any other mechanism/approach > > > that I could use/follow in order to avoid to re-calculate the index > > > result set after each teardown? > > > Thanks. > > > > > > > > > Regards, > > > Max > > > > > . > > > > > > > > -- Christopher