Re: Query that sorts a large result set.

Ard Schrijvers Thu, 18 Jun 2009 05:14:03 -0700

Hello Marcel,

On Thu, Jun 18, 2009 at 1:25 PM, Marcel
Reutegger<[email protected]> wrote:
> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> 
> wrote:
>> If you happen to find the holy grail solution, I suppose you'll let us know
>> :-) Also if you would have some memory usage numbers with and without the
>> suggestion of mine regarding reducing the precision of you Date field, this
>> would be very valuable.
>
> hmm, I'm been thinking about a solution that I would call
> flyweight-substring-collation-key. it assumes that there is usually a
> major overlap of substrings of the the values to sort on. i.e. a
> lastModified value. so instead of always keeping the entire value we'd
> have a collation key that references multiple reusable substrings.
>
> assume we have the following values:
>
> - msqyw2shb
> - msqyw2t93
> - msqyw2u0v
> - msqyw2usn
> - msqyw2vkf
> - msqyw2wc7
> - msqyw2x3z
> - msqyw2xvr
> - msqyw2ynj
> - msqyw2zfb
>
> (those are date property values each 1 second after the previous one)
>
> we could create collation keys for use as comparable in the field
> cache like this:
>
> substring cache:
> [0] msq
> [1] shb
> [2] t93
> [3] u0v
> [4] usn
> [5] vkf
> [6] wc7
> [7] x3z
> [8] xvr
> [9] ynj
> [10] yw2
> [11] zfb
>
> and then the actual comparable that reference the substrings in the cache:
>
> - {0, 10, 1}
> - {0, 10, 2}
> - {0, 10, 3}
> - {0, 10, 4}
> - {0, 10, 5}
> - {0, 10, 6}
> - {0, 10, 7}
> - {0, 10, 8}
> - {0, 10, 9}
> - {0, 10, 11}
>
> this will result in a lower memory consumption and using the reference
> indexes could even speed up the comparison.
>
> a quick test with 1 million dates values showed that the memory
> consumption drops to 50% with this approach.


I have been having these thoughts as well, did not implement anything,
but already expected the 50% drop: I namely think, that you dropped
the SharedFieldCache memory usage with for example 90%, where the
SharedFieldCache is a little more then the actual used memory: namely,
the other 50% is kept by the opened indexReader internal in lucene.
*If* we are able to actively flush this (TermEnum cache) in our opened
lucene indexes, I think we are there: namely the SharedFieldCache
contains the sorting data already, which is kept. Then, we might drop
memory to way beyond 50%. I looked at this quite some time ago, but
did not have time to see how we could flush the lucene termenums...

Actually, I think the sorting you describe is really valuable for
dates indexing, as the lucene terms have so much in common: you would
expect lucene internally might also benifit from it: Date sorting
might have special treating in Lucene, as it is such a common
thing....

Also, I have giving it a lot of thoughts to use Date indexing by
splitting it into 8 seperate fields of 16 bits per field, where we
could do fast and memory efficient sorting *and* data range queries
which are almost linear in time wrt the length of the range and the
number of documents with unique dates: It has some resemblance, I
think because do not know the ins & outs of trieRange, but it would be
based on fractals...Though this is really out-of-scope, and I am
totally unsure whether it is even feasible ever..I hope to have some
time ever to test it :-))

For now, great job Marcel!! You actually go beyond the ideas and
implement it! :-)

Regards Ard

>
> regards
>  marcel
>

Re: Query that sorts a large result set.

Reply via email to