It's done in D4M (d4m.mit.edu), you might look there. Dylan can explain (if necessary). Regards. -Jeremy
On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote: > Great. Glad I wasn't derailing things :) > > Unfortunately, I don't think this is a very well-documented area of the > code (it's quite advanced and would just confuse most users). > > I'll have to think about it some more and see if I can come up with > anything clever. I know there are some others subscribed to this list > who might be more clever than I am -- I'm sure they'll weigh in if they > have any suggestions. > > Finally, if you're interested in helping us put together some sort of > "advanced indexing" docs for the project, I'm sure we could find a few > people who would be happy to get something published on the Accumulo > website. > > Massimilian Mattetti wrote: > > Thank you for your answer John, you understood perfectly what my use > > case is. > > > > The possible solutions that you propose came to mind to me, too. This > > confirms to me that, unfortunately, there is no fancy way to overcome > > this problem. > > > > Is there any good documentation on different query planning for Accumulo > > that could help with my use case? > > Thanks. > > > > Regards, > > Max > > > > > > > > > > From: Josh Elser <josh.el...@gmail.com> > > To: user@accumulo.apache.org > > Date: 09/01/2017 21:55 > > Subject: Re: is there any "trick" to save the state of an iterator? > > ------------------------------------------------------------------------ > > > > > > > > Hey Max, > > > > There is no provided mechanism to do this, and this is a problem with > > supporting "range queries". I'm hoping I'm understanding your use-case > > correctly; sorry in advance if I'm going off on a tangent. > > > > When performing the standard sort-merge join across some columns to > > implement intersections and unions, the un-sorted range of values you > > want to scan over (500k-600k) breaks the ordering of the docIds which > > you are trying to catch. > > > > The trivial solution is to convert a range into a union of discrete > > values (500000 || 500001 || 500002 || ..) but you can see how this > > quickly falls apart. An inverted index could be used to enumerate the > > values that exist in the range. > > > > Another trivial solution would be to select all records matching the > > smaller condition, and then post-filter the other condition. > > > > There might be some more trickier query planning decisions you could > > also experiment with (I'd have to give it lots more thought). In short, > > I'd recommend against trying to solve the problem via saving state. > > Architecturally, this is just not something that Accumulo Iterators are > > designed to support at this time. > > > > - Josh > > > > Massimilian Mattetti wrote: > > > Hi all, > > > > > > I am working with a Document-Partitioned Index table whose index > > > sections are accessed using ranges over the indexed properties (e.g. > > > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.). The > > > iterator that handles this table works by: 1st - calculating (doing > > > intersection and union on different properties) all the result from the > > > index section of a single bin; 2nd - using the ids retrieved from the > > > index, it goes over the data section of the specific bin. > > > This iterator has proved to have significant performance penalty > > > whenever the amount of data retrieved from the index is orders of > > > magnitude bigger than the table_scan_max_memory i.e. the iterator is > > > teardown tens of times for each bin. Since there is no explicit way to > > > save the state of an iterator, is there any other mechanism/approach > > > that I could use/follow in order to avoid to re-calculate the index > > > result set after each teardown? > > > Thanks. > > > > > > > > > Regards, > > > Max > > > > > . > > > > > > > >