On 17 Jun 2009, at 10:04, Ard Schrijvers wrote:
Hello,
I think there is another issue you might hit before the part Marcel
describes:
Suppose, as your result set is 1.000.000 nodes, that you have
10.000.000
nodes containing a Date.
A date is internal stored in lucene as 9 chars + the prefix of the
propertyname, suppose 'lastModified' + some namespace & delimiters
overhead.
say 5 chars. This is in total 26 chars (smaller name for
'lastModified' will
save memory in the end (though, it has been a while, so i might be
wrong)).
Now, when you want to sort in lucene, first, *all* the lastModified
lucene
terms are read in memory (suppose 26 chars ~ 100 bytes and 9 chars
~ 80
bytes memory)
10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the
jackrabbit
SharedFieldCache will occupy another 10.000.000 * 80 bytes (+
overhead for
nodes not having a date, which might be 90% * 4 bytes a piece)
This is what I was worried about,
Anyways, conclusion, if you have 10.000.000 nodes with lastModified,
sorting
on it will cost you directly 1.8 Gb, which cannot be freed by a GC,
but will
be lost during the rest of jvm life (untill indexes merge, but this
is rare
cornercase for big indices).
This makes me a bit more worried, since I thought that at least the
memory would be GC'd at the end of the request,
So presumably if the user asks for the first 100 hits sorted by
lastmodified, subject, status then; will each of those distinct
searches consume additional memory not freed at the end of the request ?
There are 2 problems here for us, the UX people are demanding sorting
of every column that is displayed, and we are using Sling which has a
Search servlet that accepts XPath or SQL, so I can craft a query that
will generate OOM for the JVM even if the UI is not causing the
problem, we may have to remove that servlet, if my fears a real.
Basically, this is imo the first issue on sorting large data sets
(if you
sort in title or a property that contains large strings, memory is
gone even
faster). Also, the doubling (1 Gb in lucene and .8 Gb in
SharedFieldCache
could be avoided, but needs a large change wrt to indexing properties)
Regarding the resultFetchSize, typically when you want to have an
archive
where you want to diplay all pages is not an option, is it?
agreed, a UX with 1M items in a list isnt really usable, the max they
want is 100, so there is not much point in fetching the entire set.
I suppose that if I use a setLimit(3) on a query, that it runtime
lowers the
resultFetchSize, isn't it? This would make it indeed much more
efficient if
you only want the last 10 news items added. Is this correct?
I think so, if I follow you correctly.
Regarding [2] I think would be nice if we can add this. If it
happens to be
really hard, we could perhaps more easily create an indexing
configuration
where we define the precision/granularity of the property Date to be
indexed...this is easy and has a major performance increase, only, the
precision is lowered for searching on dates.
Regards Ard
On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <[email protected]
wrote:
Hi,
the sorting is pretty well optimized, it basically uses underlying
lucene functionality for that. there are two other important points
that will influence performance:
1) workspace configuration
the default workspace configuration will cause initial fetching of
the
entire result set. you can change this behavior by setting the
resultFetchSize parameter. See [0].
2) Ian wrote: "I only want to see a small number of items eg 100
after
a particular date."
that might actually become a problem. it will result in a range query
that potentially selects lots (millions?) of nodes with distinct date
properties. this case is not optimized. there's a new indexing
technique in lucene called trierange queries [1] which was
specifically built to perform such queries efficiently. but this is
not yet integrated with jackrabbit.
I've created a JIRA issue to discuss and keep track of such an
enhancement in jackrabbit: [2]
regards
marcel
[0] http://issues.apache.org/jira/browse/JCR-651
[1]
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
[2] https://issues.apache.org/jira/browse/JCR-2151
On Wed, Jun 17, 2009 at 01:50, Ian Boston<[email protected]> wrote:
Hi,
I want to perform a query where the full result set could be
millions of
items. That set needs to be sorted by the lastModified attribute
on the
node, and I only want to see a small number of items eg 100 after a
particular date.
If I do this, will there be scalability issues, or is the sorting
of a
date
field optimized in the query engine ?
Thanks
Ian