Hello Marcel, As I like this solution, it seems to me to only suitable for dates, right? How do we know that we are sorting on a date...by checking whethet it has length 9..or that it starts with msq? Furthermore, I am quite curious how you implemented this below. If you just used substrings, we could gain quite a bit more with, but i am not sure whether you already do this:
Suppose String s = "msqyw2shb"; If you are having String[0] = s.subString(0,3); we reduce memory usage quite a bit more with String[0] = new String(s.subString(0,3)) Also see [1]. But perhaps you are already doing this. A direct small improvement we could directly make is replacing : retArray[termDocs.doc()] = term.text().substring(prefix.length()); with retArray[termDocs.doc()] = new String(term.text().substring(prefix.length())); It is a bit strange, but as for dates I think the prefix.length is something like "lastModified" and a delimiter, suppose 13 chars..this would bring back the char array retained in memory back from 22 to 9...(for dates) Furthermore, it follows that using short property names saves you memory. This could be avoided in the end if we index each property in its own lucene field, instead of all in :_PROPERTIES and prefix the value with the propertyname..this though requires quite some rewrite for indexing i think. [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622 On Thu, Jun 18, 2009 at 1:25 PM, Marcel Reutegger<[email protected]> wrote: > On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> > wrote: >> If you happen to find the holy grail solution, I suppose you'll let us know >> :-) Also if you would have some memory usage numbers with and without the >> suggestion of mine regarding reducing the precision of you Date field, this >> would be very valuable. > > hmm, I'm been thinking about a solution that I would call > flyweight-substring-collation-key. it assumes that there is usually a > major overlap of substrings of the the values to sort on. i.e. a > lastModified value. so instead of always keeping the entire value we'd > have a collation key that references multiple reusable substrings. > > assume we have the following values: > > - msqyw2shb > - msqyw2t93 > - msqyw2u0v > - msqyw2usn > - msqyw2vkf > - msqyw2wc7 > - msqyw2x3z > - msqyw2xvr > - msqyw2ynj > - msqyw2zfb > > (those are date property values each 1 second after the previous one) > > we could create collation keys for use as comparable in the field > cache like this: > > substring cache: > [0] msq > [1] shb > [2] t93 > [3] u0v > [4] usn > [5] vkf > [6] wc7 > [7] x3z > [8] xvr > [9] ynj > [10] yw2 > [11] zfb > > and then the actual comparable that reference the substrings in the cache: > > - {0, 10, 1} > - {0, 10, 2} > - {0, 10, 3} > - {0, 10, 4} > - {0, 10, 5} > - {0, 10, 6} > - {0, 10, 7} > - {0, 10, 8} > - {0, 10, 9} > - {0, 10, 11} > > this will result in a lower memory consumption and using the reference > indexes could even speed up the comparison. > > a quick test with 1 million dates values showed that the memory > consumption drops to 50% with this approach. > > regards > marcel >
