Hi Ard, I think this discussion rather belongs to the dev list.
I'll reply there... regards marcel On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers<[email protected]> wrote: > Hello Marcel, > > As I like this solution, it seems to me to only suitable for dates, > right? How do we know that we are sorting on a date...by checking > whethet it has length 9..or that it starts with msq? Furthermore, I am > quite curious how you implemented this below. If you just used > substrings, we could gain quite a bit more with, but i am not sure > whether you already do this: > > Suppose > > String s = "msqyw2shb"; > > If you are having > > String[0] = s.subString(0,3); > > we reduce memory usage quite a bit more with > > String[0] = new String(s.subString(0,3)) > > Also see [1]. But perhaps you are already doing this. > > A direct small improvement we could directly make is replacing : > > retArray[termDocs.doc()] = term.text().substring(prefix.length()); > > with > > retArray[termDocs.doc()] = new String(term.text().substring(prefix.length())); > > It is a bit strange, but as for dates I think the prefix.length is > something like "lastModified" and a delimiter, suppose 13 chars..this > would bring back the char array retained in memory back from 22 to > 9...(for dates) > > Furthermore, it follows that using short property names saves you > memory. This could be avoided in the end if we index each property in > its own lucene field, instead of all in :_PROPERTIES and prefix the > value with the propertyname..this though requires quite some rewrite > for indexing i think. > > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622 > > > > On Thu, Jun 18, 2009 at 1:25 PM, Marcel > Reutegger<[email protected]> wrote: >> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> >> wrote: >>> If you happen to find the holy grail solution, I suppose you'll let us know >>> :-) Also if you would have some memory usage numbers with and without the >>> suggestion of mine regarding reducing the precision of you Date field, this >>> would be very valuable. >> >> hmm, I'm been thinking about a solution that I would call >> flyweight-substring-collation-key. it assumes that there is usually a >> major overlap of substrings of the the values to sort on. i.e. a >> lastModified value. so instead of always keeping the entire value we'd >> have a collation key that references multiple reusable substrings. >> >> assume we have the following values: >> >> - msqyw2shb >> - msqyw2t93 >> - msqyw2u0v >> - msqyw2usn >> - msqyw2vkf >> - msqyw2wc7 >> - msqyw2x3z >> - msqyw2xvr >> - msqyw2ynj >> - msqyw2zfb >> >> (those are date property values each 1 second after the previous one) >> >> we could create collation keys for use as comparable in the field >> cache like this: >> >> substring cache: >> [0] msq >> [1] shb >> [2] t93 >> [3] u0v >> [4] usn >> [5] vkf >> [6] wc7 >> [7] x3z >> [8] xvr >> [9] ynj >> [10] yw2 >> [11] zfb >> >> and then the actual comparable that reference the substrings in the cache: >> >> - {0, 10, 1} >> - {0, 10, 2} >> - {0, 10, 3} >> - {0, 10, 4} >> - {0, 10, 5} >> - {0, 10, 6} >> - {0, 10, 7} >> - {0, 10, 8} >> - {0, 10, 9} >> - {0, 10, 11} >> >> this will result in a lower memory consumption and using the reference >> indexes could even speed up the comparison. >> >> a quick test with 1 million dates values showed that the memory >> consumption drops to 50% with this approach. >> >> regards >> marcel >> >
