Re: Query that sorts a large result set.

Ard Schrijvers Thu, 18 Jun 2009 14:20:41 -0700

Hello Marcel,

As I like this solution, it seems to me to only suitable for dates,
right? How do we know that we are sorting on a date...by checking
whethet it has length 9..or that it starts with msq? Furthermore, I am
quite curious how you implemented this below. If you just used
substrings, we could gain quite a bit more with, but i am not sure
whether you already do this:


Suppose

String s = "msqyw2shb";

If you are having

String[0] = s.subString(0,3);

we reduce memory usage quite a bit more with

String[0] = new String(s.subString(0,3))

Also see [1]. But perhaps you are already doing this.

A direct small improvement we could directly make is replacing :

retArray[termDocs.doc()] = term.text().substring(prefix.length());

with

retArray[termDocs.doc()] = new String(term.text().substring(prefix.length()));

It is a bit strange, but as for dates I think the prefix.length is
something like "lastModified" and a delimiter, suppose 13 chars..this
would bring back the char array retained in memory back from 22 to
9...(for dates)

Furthermore, it follows that using short property names saves you
memory. This could be avoided in the end if we index each  property in
its own lucene field, instead of all in :_PROPERTIES and prefix the
value with the propertyname..this though requires quite some rewrite
for indexing i think.

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622



On Thu, Jun 18, 2009 at 1:25 PM, Marcel
Reutegger<[email protected]> wrote:
> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> 
> wrote:
>> If you happen to find the holy grail solution, I suppose you'll let us know
>> :-) Also if you would have some memory usage numbers with and without the
>> suggestion of mine regarding reducing the precision of you Date field, this
>> would be very valuable.
>
> hmm, I'm been thinking about a solution that I would call
> flyweight-substring-collation-key. it assumes that there is usually a
> major overlap of substrings of the the values to sort on. i.e. a
> lastModified value. so instead of always keeping the entire value we'd
> have a collation key that references multiple reusable substrings.
>
> assume we have the following values:
>
> - msqyw2shb
> - msqyw2t93
> - msqyw2u0v
> - msqyw2usn
> - msqyw2vkf
> - msqyw2wc7
> - msqyw2x3z
> - msqyw2xvr
> - msqyw2ynj
> - msqyw2zfb
>
> (those are date property values each 1 second after the previous one)
>
> we could create collation keys for use as comparable in the field
> cache like this:
>
> substring cache:
> [0] msq
> [1] shb
> [2] t93
> [3] u0v
> [4] usn
> [5] vkf
> [6] wc7
> [7] x3z
> [8] xvr
> [9] ynj
> [10] yw2
> [11] zfb
>
> and then the actual comparable that reference the substrings in the cache:
>
> - {0, 10, 1}
> - {0, 10, 2}
> - {0, 10, 3}
> - {0, 10, 4}
> - {0, 10, 5}
> - {0, 10, 6}
> - {0, 10, 7}
> - {0, 10, 8}
> - {0, 10, 9}
> - {0, 10, 11}
>
> this will result in a lower memory consumption and using the reference
> indexes could even speed up the comparison.
>
> a quick test with 1 million dates values showed that the memory
> consumption drops to 50% with this approach.
>
> regards
>  marcel
>

Re: Query that sorts a large result set.

Reply via email to