Hello, I think there is another issue you might hit before the part Marcel describes:
Suppose, as your result set is 1.000.000 nodes, that you have 10.000.000 nodes containing a Date. A date is internal stored in lucene as 9 chars + the prefix of the propertyname, suppose 'lastModified' + some namespace & delimiters overhead. say 5 chars. This is in total 26 chars (smaller name for 'lastModified' will save memory in the end (though, it has been a while, so i might be wrong)). Now, when you want to sort in lucene, first, *all* the lastModified lucene terms are read in memory (suppose 26 chars ~ 100 bytes and 9 chars ~ 80 bytes memory) 10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the jackrabbit SharedFieldCache will occupy another 10.000.000 * 80 bytes (+ overhead for nodes not having a date, which might be 90% * 4 bytes a piece) Anyways, conclusion, if you have 10.000.000 nodes with lastModified, sorting on it will cost you directly 1.8 Gb, which cannot be freed by a GC, but will be lost during the rest of jvm life (untill indexes merge, but this is rare cornercase for big indices). Basically, this is imo the first issue on sorting large data sets (if you sort in title or a property that contains large strings, memory is gone even faster). Also, the doubling (1 Gb in lucene and .8 Gb in SharedFieldCache could be avoided, but needs a large change wrt to indexing properties) Regarding the resultFetchSize, typically when you want to have an archive where you want to diplay all pages is not an option, is it? I suppose that if I use a setLimit(3) on a query, that it runtime lowers the resultFetchSize, isn't it? This would make it indeed much more efficient if you only want the last 10 news items added. Is this correct? Regarding [2] I think would be nice if we can add this. If it happens to be really hard, we could perhaps more easily create an indexing configuration where we define the precision/granularity of the property Date to be indexed...this is easy and has a major performance increase, only, the precision is lowered for searching on dates. Regards Ard On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <[email protected] > wrote: > Hi, > > the sorting is pretty well optimized, it basically uses underlying > lucene functionality for that. there are two other important points > that will influence performance: > > 1) workspace configuration > > the default workspace configuration will cause initial fetching of the > entire result set. you can change this behavior by setting the > resultFetchSize parameter. See [0]. > > 2) Ian wrote: "I only want to see a small number of items eg 100 after > a particular date." > > that might actually become a problem. it will result in a range query > that potentially selects lots (millions?) of nodes with distinct date > properties. this case is not optimized. there's a new indexing > technique in lucene called trierange queries [1] which was > specifically built to perform such queries efficiently. but this is > not yet integrated with jackrabbit. > > I've created a JIRA issue to discuss and keep track of such an > enhancement in jackrabbit: [2] > > regards > marcel > > [0] http://issues.apache.org/jira/browse/JCR-651 > [1] > http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/ > [2] https://issues.apache.org/jira/browse/JCR-2151 > > On Wed, Jun 17, 2009 at 01:50, Ian Boston<[email protected]> wrote: > > Hi, > > > > I want to perform a query where the full result set could be millions of > > items. That set needs to be sorted by the lastModified attribute on the > > node, and I only want to see a small number of items eg 100 after a > > particular date. > > > > If I do this, will there be scalability issues, or is the sorting of a > date > > field optimized in the query engine ? > > > > Thanks > > Ian > > >
