On 17 Jun 2009, at 10:04, Ard Schrijvers wrote:

Hello,

I think there is another issue you might hit before the part Marcel
describes:

Suppose, as your result set is 1.000.000 nodes, that you have 10.000.000
nodes containing a Date.

A date is internal stored in lucene as 9 chars + the prefix of the
propertyname, suppose 'lastModified' + some namespace & delimiters overhead. say 5 chars. This is in total 26 chars (smaller name for 'lastModified' will save memory in the end (though, it has been a while, so i might be wrong)).

Now, when you want to sort in lucene, first, *all* the lastModified lucene terms are read in memory (suppose 26 chars ~ 100 bytes and 9 chars ~ 80
bytes memory)

10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the jackrabbit SharedFieldCache will occupy another 10.000.000 * 80 bytes (+ overhead for
nodes not having a date, which might be 90% * 4 bytes a piece)

This is what I was worried about,



Anyways, conclusion, if you have 10.000.000 nodes with lastModified, sorting on it will cost you directly 1.8 Gb, which cannot be freed by a GC, but will be lost during the rest of jvm life (untill indexes merge, but this is rare
cornercase for big indices).


This makes me a bit more worried, since I thought that at least the memory would be GC'd at the end of the request, So presumably if the user asks for the first 100 hits sorted by lastmodified, subject, status then; will each of those distinct searches consume additional memory not freed at the end of the request ?

There are 2 problems here for us, the UX people are demanding sorting of every column that is displayed, and we are using Sling which has a Search servlet that accepts XPath or SQL, so I can craft a query that will generate OOM for the JVM even if the UI is not causing the problem, we may have to remove that servlet, if my fears a real.




Basically, this is imo the first issue on sorting large data sets (if you sort in title or a property that contains large strings, memory is gone even faster). Also, the doubling (1 Gb in lucene and .8 Gb in SharedFieldCache
could be avoided, but needs a large change wrt to indexing properties)

Regarding the resultFetchSize, typically when you want to have an archive
where you want to diplay all pages is not an option, is it?

agreed, a UX with 1M items in a list isnt really usable, the max they want is 100, so there is not much point in fetching the entire set.


I suppose that if I use a setLimit(3) on a query, that it runtime lowers the resultFetchSize, isn't it? This would make it indeed much more efficient if
you only want the last 10 news items added. Is this correct?

I think so, if I follow you correctly.


Regarding [2] I think would be nice if we can add this. If it happens to be really hard, we could perhaps more easily create an indexing configuration
where we define the precision/granularity of the property Date to be
indexed...this is easy and has a major performance increase, only, the
precision is lowered for searching on dates.

Regards Ard


On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <[email protected]
wrote:

Hi,

the sorting is pretty well optimized, it basically uses underlying
lucene functionality for that. there are two other important points
that will influence performance:

1) workspace configuration

the default workspace configuration will cause initial fetching of the
entire result set. you can change this behavior by setting the
resultFetchSize parameter. See [0].

2) Ian wrote: "I only want to see a small number of items eg 100 after
a particular date."

that might actually become a problem. it will result in a range query
that potentially selects lots (millions?) of nodes with distinct date
properties. this case is not optimized. there's a new indexing
technique in lucene called trierange queries [1] which was
specifically built to perform such queries efficiently. but this is
not yet integrated with jackrabbit.

I've created a JIRA issue to discuss and keep track of such an
enhancement in jackrabbit: [2]

regards
marcel

[0] http://issues.apache.org/jira/browse/JCR-651
[1]
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
[2] https://issues.apache.org/jira/browse/JCR-2151

On Wed, Jun 17, 2009 at 01:50, Ian Boston<[email protected]> wrote:
Hi,

I want to perform a query where the full result set could be millions of items. That set needs to be sorted by the lastModified attribute on the
node, and I only want to see a small number of items eg 100 after a
particular date.

If I do this, will there be scalability issues, or is the sorting of a
date
field optimized in the query engine ?

Thanks
Ian



Reply via email to