Re: Query that sorts a large result set.

Ard Schrijvers Wed, 17 Jun 2009 02:05:13 -0700

Hello,

I think there is another issue you might hit before the part Marcel
describes:

Suppose, as your result set is 1.000.000 nodes, that you have 10.000.000
nodes containing a Date.

A date is internal stored in lucene as 9 chars + the prefix of the
propertyname, suppose 'lastModified' + some namespace & delimiters overhead.
say 5 chars. This is in total 26 chars (smaller name for 'lastModified' will
save memory in the end (though, it has been a while, so i might be wrong)).

Now, when you want to sort in lucene, first, *all* the lastModified lucene
terms are read in memory (suppose 26 chars ~ 100 bytes and  9 chars ~ 80
bytes memory)

10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the jackrabbit
SharedFieldCache will occupy another 10.000.000 * 80 bytes (+ overhead for
nodes not having a date, which might be 90% * 4 bytes a piece)

Anyways, conclusion, if you have 10.000.000 nodes with lastModified, sorting
on it will cost you directly 1.8 Gb, which cannot be freed by a GC, but will
be lost during the rest of jvm life (untill indexes merge, but this is rare
cornercase for big indices).

Basically, this is imo the first issue on sorting large data sets (if you
sort in title or a property that contains large strings, memory is gone even
faster). Also, the doubling (1 Gb in lucene and .8 Gb in SharedFieldCache
could be avoided, but needs a large change wrt to indexing properties)

Regarding the resultFetchSize, typically when you want to have an archive
where you want to diplay all pages is not an option, is it?

I suppose that if I use a setLimit(3) on a query, that it runtime lowers the
resultFetchSize, isn't it? This would make it indeed much more efficient if
you only want the last 10 news items added. Is this correct?

Regarding [2] I think would be nice if we can add this. If it happens to be
really hard, we could perhaps more easily create an indexing configuration
where we define the precision/granularity of the property Date to be
indexed...this is easy and has a major performance increase, only, the
precision is lowered for searching on dates.

Regards Ard

On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <[email protected]
> wrote:

> Hi,
>
> the sorting is pretty well optimized, it basically uses underlying
> lucene functionality for that. there are two other important points
> that will influence performance:
>
> 1) workspace configuration
>
> the default workspace configuration will cause initial fetching of the
> entire result set. you can change this behavior by setting the
> resultFetchSize parameter. See [0].
>
> 2) Ian wrote: "I only want to see a small number of items eg 100 after
> a particular date."
>
> that might actually become a problem. it will result in a range query
> that potentially selects lots (millions?) of nodes with distinct date
> properties. this case is not optimized. there's a new indexing
> technique in lucene called trierange queries [1] which was
> specifically built to perform such queries efficiently. but this is
> not yet integrated with jackrabbit.
>
> I've created a JIRA issue to discuss and keep track of such an
> enhancement in jackrabbit: [2]
>
> regards
>  marcel
>
> [0] http://issues.apache.org/jira/browse/JCR-651
> [1]
> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
> [2] https://issues.apache.org/jira/browse/JCR-2151
>
> On Wed, Jun 17, 2009 at 01:50, Ian Boston<[email protected]> wrote:
> > Hi,
> >
> > I want to perform a query where the full result set could be millions of
> > items. That set needs to be sorted by the lastModified attribute on the
> > node, and I only want to see a small number of items eg 100 after a
> > particular date.
> >
> > If I do this, will there be scalability issues, or is the sorting of a
> date
> > field optimized in the query engine ?
> >
> > Thanks
> > Ian
> >
>

Re: Query that sorts a large result set.

Reply via email to