Hello Marcel, On Thu, Jun 18, 2009 at 1:25 PM, Marcel Reutegger<[email protected]> wrote: > On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <[email protected]> > wrote: >> If you happen to find the holy grail solution, I suppose you'll let us know >> :-) Also if you would have some memory usage numbers with and without the >> suggestion of mine regarding reducing the precision of you Date field, this >> would be very valuable. > > hmm, I'm been thinking about a solution that I would call > flyweight-substring-collation-key. it assumes that there is usually a > major overlap of substrings of the the values to sort on. i.e. a > lastModified value. so instead of always keeping the entire value we'd > have a collation key that references multiple reusable substrings. > > assume we have the following values: > > - msqyw2shb > - msqyw2t93 > - msqyw2u0v > - msqyw2usn > - msqyw2vkf > - msqyw2wc7 > - msqyw2x3z > - msqyw2xvr > - msqyw2ynj > - msqyw2zfb > > (those are date property values each 1 second after the previous one) > > we could create collation keys for use as comparable in the field > cache like this: > > substring cache: > [0] msq > [1] shb > [2] t93 > [3] u0v > [4] usn > [5] vkf > [6] wc7 > [7] x3z > [8] xvr > [9] ynj > [10] yw2 > [11] zfb > > and then the actual comparable that reference the substrings in the cache: > > - {0, 10, 1} > - {0, 10, 2} > - {0, 10, 3} > - {0, 10, 4} > - {0, 10, 5} > - {0, 10, 6} > - {0, 10, 7} > - {0, 10, 8} > - {0, 10, 9} > - {0, 10, 11} > > this will result in a lower memory consumption and using the reference > indexes could even speed up the comparison. > > a quick test with 1 million dates values showed that the memory > consumption drops to 50% with this approach.
I have been having these thoughts as well, did not implement anything, but already expected the 50% drop: I namely think, that you dropped the SharedFieldCache memory usage with for example 90%, where the SharedFieldCache is a little more then the actual used memory: namely, the other 50% is kept by the opened indexReader internal in lucene. *If* we are able to actively flush this (TermEnum cache) in our opened lucene indexes, I think we are there: namely the SharedFieldCache contains the sorting data already, which is kept. Then, we might drop memory to way beyond 50%. I looked at this quite some time ago, but did not have time to see how we could flush the lucene termenums... Actually, I think the sorting you describe is really valuable for dates indexing, as the lucene terms have so much in common: you would expect lucene internally might also benifit from it: Date sorting might have special treating in Lucene, as it is such a common thing.... Also, I have giving it a lot of thoughts to use Date indexing by splitting it into 8 seperate fields of 16 bits per field, where we could do fast and memory efficient sorting *and* data range queries which are almost linear in time wrt the length of the range and the number of documents with unique dates: It has some resemblance, I think because do not know the ins & outs of trieRange, but it would be based on fractals...Though this is really out-of-scope, and I am totally unsure whether it is even feasible ever..I hope to have some time ever to test it :-)) For now, great job Marcel!! You actually go beyond the ideas and implement it! :-) Regards Ard > > regards > marcel >
