FWIW, there is an open pull request on that issue that puts the work very near to completion. It could probably use a bit more testing and review, though.
On Mon, Jan 9, 2017 at 9:37 PM Josh Elser <josh.el...@gmail.com> wrote: > And yet, Accumulo still doesn't have the API to safely do it. > > See ACCUMULO-1280 if you'd like to contribute towards to those efforts for > the community. > > On Jan 9, 2017 20:23, "Jeremy Kepner" <kep...@ll.mit.edu> wrote: > > It's done in D4M (d4m.mit.edu), you might look there. > Dylan can explain (if necessary). > Regards. -Jeremy > > On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote: > > Great. Glad I wasn't derailing things :) > > > > Unfortunately, I don't think this is a very well-documented area of the > > code (it's quite advanced and would just confuse most users). > > > > I'll have to think about it some more and see if I can come up with > > anything clever. I know there are some others subscribed to this list > > who might be more clever than I am -- I'm sure they'll weigh in if they > > have any suggestions. > > > > Finally, if you're interested in helping us put together some sort of > > "advanced indexing" docs for the project, I'm sure we could find a few > > people who would be happy to get something published on the Accumulo > > website. > > > > Massimilian Mattetti wrote: > > > Thank you for your answer John, you understood perfectly what my use > > > case is. > > > > > > The possible solutions that you propose came to mind to me, too. This > > > confirms to me that, unfortunately, there is no fancy way to overcome > > > this problem. > > > > > > Is there any good documentation on different query planning for > Accumulo > > > that could help with my use case? > > > Thanks. > > > > > > Regards, > > > Max > > > > > > > > > > > > > > > From: Josh Elser <josh.el...@gmail.com> > > > To: user@accumulo.apache.org > > > Date: 09/01/2017 21:55 > > > Subject: Re: is there any "trick" to save the state of an iterator? > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > Hey Max, > > > > > > There is no provided mechanism to do this, and this is a problem with > > > supporting "range queries". I'm hoping I'm understanding your use-case > > > correctly; sorry in advance if I'm going off on a tangent. > > > > > > When performing the standard sort-merge join across some columns to > > > implement intersections and unions, the un-sorted range of values you > > > want to scan over (500k-600k) breaks the ordering of the docIds which > > > you are trying to catch. > > > > > > The trivial solution is to convert a range into a union of discrete > > > values (500000 || 500001 || 500002 || ..) but you can see how this > > > quickly falls apart. An inverted index could be used to enumerate the > > > values that exist in the range. > > > > > > Another trivial solution would be to select all records matching the > > > smaller condition, and then post-filter the other condition. > > > > > > There might be some more trickier query planning decisions you could > > > also experiment with (I'd have to give it lots more thought). In short, > > > I'd recommend against trying to solve the problem via saving state. > > > Architecturally, this is just not something that Accumulo Iterators are > > > designed to support at this time. > > > > > > - Josh > > > > > > Massimilian Mattetti wrote: > > > > Hi all, > > > > > > > > I am working with a Document-Partitioned Index table whose index > > > > sections are accessed using ranges over the indexed properties (e.g. > > > > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.). > The > > > > iterator that handles this table works by: 1st - calculating (doing > > > > intersection and union on different properties) all the result from > the > > > > index section of a single bin; 2nd - using the ids retrieved from > the > > > > index, it goes over the data section of the specific bin. > > > > This iterator has proved to have significant performance penalty > > > > whenever the amount of data retrieved from the index is orders of > > > > magnitude bigger than the table_scan_max_memory i.e. the iterator is > > > > teardown tens of times for each bin. Since there is no explicit way > to > > > > save the state of an iterator, is there any other mechanism/approach > > > > that I could use/follow in order to avoid to re-calculate the index > > > > result set after each teardown? > > > > Thanks. > > > > > > > > > > > > Regards, > > > > Max > > > > > > > . > > > > > > > > > > > > > > -- Christopher