Re: Query that sorts a large result set.

Ian Boston Wed, 17 Jun 2009 09:13:15 -0700


On 17 Jun 2009, at 10:04, Ard Schrijvers wrote:

Hello,

I think there is another issue you might hit before the part Marcel
describes:
Suppose, as your result set is 1.000.000 nodes, that you have10.000.000
nodes containing a Date.

A date is internal stored in lucene as 9 chars + the prefix of the
propertyname, suppose 'lastModified' + some namespace & delimitersoverhead.say 5 chars. This is in total 26 chars (smaller name for'lastModified' willsave memory in the end (though, it has been a while, so i might bewrong)).
Now, when you want to sort in lucene, first, *all* the lastModifiedluceneterms are read in memory (suppose 26 chars ~ 100 bytes and 9 chars~ 80
bytes memory)
10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + thejackrabbitSharedFieldCache will occupy another 10.000.000 * 80 bytes (+overhead for
nodes not having a date, which might be 90% * 4 bytes a piece)


This is what I was worried about,

Anyways, conclusion, if you have 10.000.000 nodes with lastModified,sortingon it will cost you directly 1.8 Gb, which cannot be freed by a GC,but willbe lost during the rest of jvm life (untill indexes merge, but thisis rare
cornercase for big indices).

This makes me a bit more worried, since I thought that at least thememory would be GC'd at the end of the request,So presumably if the user asks for the first 100 hits sorted bylastmodified, subject, status then; will each of those distinctsearches consume additional memory not freed at the end of the request ?

There are 2 problems here for us, the UX people are demanding sortingof every column that is displayed, and we are using Sling which has aSearch servlet that accepts XPath or SQL, so I can craft a query thatwill generate OOM for the JVM even if the UI is not causing theproblem, we may have to remove that servlet, if my fears a real.

Basically, this is imo the first issue on sorting large data sets(if yousort in title or a property that contains large strings, memory isgone evenfaster). Also, the doubling (1 Gb in lucene and .8 Gb inSharedFieldCache
could be avoided, but needs a large change wrt to indexing properties)
Regarding the resultFetchSize, typically when you want to have anarchive
where you want to diplay all pages is not an option, is it?

agreed, a UX with 1M items in a list isnt really usable, the max theywant is 100, so there is not much point in fetching the entire set.

I suppose that if I use a setLimit(3) on a query, that it runtimelowers theresultFetchSize, isn't it? This would make it indeed much moreefficient if
you only want the last 10 news items added. Is this correct?


I think so, if I follow you correctly.

Regarding [2] I think would be nice if we can add this. If ithappens to bereally hard, we could perhaps more easily create an indexingconfiguration

where we define the precision/granularity of the property Date to be
indexed...this is easy and has a major performance increase, only, the
precision is lowered for searching on dates.

Regards Ard


On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <[email protected]

wrote:

Hi,

the sorting is pretty well optimized, it basically uses underlying
lucene functionality for that. there are two other important points
that will influence performance:

1) workspace configuration

the default workspace configuration will cause initial fetching ofthe

entire result set. you can change this behavior by setting the
resultFetchSize parameter. See [0].

2) Ian wrote: "I only want to see a small number of items eg 100after

a particular date."

that might actually become a problem. it will result in a range query
that potentially selects lots (millions?) of nodes with distinct date
properties. this case is not optimized. there's a new indexing
technique in lucene called trierange queries [1] which was
specifically built to perform such queries efficiently. but this is
not yet integrated with jackrabbit.

I've created a JIRA issue to discuss and keep track of such an
enhancement in jackrabbit: [2]

regards
marcel

[0] http://issues.apache.org/jira/browse/JCR-651
[1]
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
[2] https://issues.apache.org/jira/browse/JCR-2151

On Wed, Jun 17, 2009 at 01:50, Ian Boston<[email protected]> wrote:

Hi,
I want to perform a query where the full result set could bemillions ofitems. That set needs to be sorted by the lastModified attributeon the
node, and I only want to see a small number of items eg 100 after a
particular date.
If I do this, will there be scalability issues, or is the sortingof a

date

field optimized in the query engine ?

Thanks
Ian

Re: Query that sorts a large result set.

Reply via email to