Thank you for your answer John, you understood perfectly what my use case 
is.

The possible solutions that you propose came to mind to me, too. This 
confirms to me that, unfortunately, there is no fancy way to overcome this 
problem. 

Is there any good documentation on different query planning for Accumulo 
that could help with my use case?
Thanks.

Regards,
Max




From:   Josh Elser <josh.el...@gmail.com>
To:     user@accumulo.apache.org
Date:   09/01/2017 21:55
Subject:        Re: is there any "trick" to save the state of an iterator?



Hey Max,

There is no provided mechanism to do this, and this is a problem with
supporting "range queries". I'm hoping I'm understanding your use-case
correctly; sorry in advance if I'm going off on a tangent.

When performing the standard sort-merge join across some columns to
implement intersections and unions, the un-sorted range of values you
want to scan over (500k-600k) breaks the ordering of the docIds which
you are trying to catch.

The trivial solution is to convert a range into a union of discrete
values (500000 || 500001 || 500002 || ..) but you can see how this
quickly falls apart. An inverted index could be used to enumerate the
values that exist in the range.

Another trivial solution would be to select all records matching the
smaller condition, and then post-filter the other condition.

There might be some more trickier query planning decisions you could
also experiment with (I'd have to give it lots more thought). In short,
I'd recommend against trying to solve the problem via saving state.
Architecturally, this is just not something that Accumulo Iterators are
designed to support at this time.

- Josh

Massimilian Mattetti wrote:
> Hi all,
> 
> I am working with a Document-Partitioned Index table whose index 
> sections are accessed using ranges over the indexed properties (e.g. 
> property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.). The 

> iterator that handles this table works by: 1st - calculating (doing 
> intersection and union on different properties) all the result from the 
> index section of a single bin; 2nd - using the ids retrieved from the 
> index, it goes over the data section of the specific bin.
> This iterator has proved to have significant performance penalty 
> whenever the amount of data retrieved from the index is orders of 
> magnitude bigger than the table_scan_max_memory i.e. the iterator is 
> teardown tens of times for each bin. Since there is no explicit way to 
> save the state of an iterator, is there any other mechanism/approach 
> that I could use/follow in order to avoid to re-calculate the index 
> result set after each teardown?
> Thanks.
> 
> 
> Regards,
> Max
> 
.





Reply via email to