Re: Explanation and solutions of some Jackrabbit queries regarding performance

Marcel Reutegger Wed, 23 Jan 2008 02:12:52 -0800

Hi Ard,

excellent work. this should definitively be placed on a query faq wiki page.


regards
 marcel

Ard Schrijvers wrote:

Hello Martin Zdila regarding JCR-1196 et al,

from time to time I see mails regarding performance of queries and slow
things like queryResult.getNodes().hasNext(). There are queries which
can be slow, there are data modelling structures which might be slow,
and there are seemingly trivial things like
queryResult.getNodes().hasNext() which might be slow. I write 'might'
all the time, because everything can and must be blistering fast with
millions of documents, and most of the time, solutions are extremely
simple to achieve this. We just have to document some pitfalls of easy
made mistakes. I'll try to find some time in the near future to document
some parts I am aware of in the form of a FAQ, like the rest of this
mail will be. For now just some frequently made mistakes from the top of
my head:

@Martin Zdila : if you are not interested in reading the rest of this
mail, just add <param name="respectDocumentOrder" value="false"/> to the
<SearchIndex> element of your workspace.xml (and repository.xml). Also
try to avoid 4000 node childs (certainly same name nodes) under one
node, try to create a larger tree where nodes to not contains many child
nodes. This is just like your filesystem not fast


Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c'
or '//[EMAIL PROTECTED]' ?

Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c'
will be executed, the hierarchy manager has to check all found nodes
wether their parents are correct. Since Jackrabbit does not store
hierarchical data (if it would, it could not efficiently move a node
anymore, at least in the current architecture), hierarchies need to be
checked by iterating through the lucene indexes to find parent nodes of
a result. This is cpu consuming. Although since Jackrabbit 1.4 the
hierarchy is cached properly, returning many results is still an
expensive operation. The first execution of a query might be slow
because the hierarchy cache needs to be build up. Queries like '//c' or
'//[EMAIL PROTECTED]' do not need to check hierarchies, because results do

not need to check wether they are allowed according their parent node.

Conclusion 1: When the resultset of the search is expected to be large,
try to avoid path info in the xpath. Try to distinguish based on for
example nodetype or some property.

Question 2: My xpath was '//c' and the result size is 10.000 nodes. When
I call queryResult.getNodes().hasNext() it takes up to minutes to

complete this call.

Answer 2: For Jackrabbit version < 1.5 , the default setting in the

<SearchIndex> configuration in repository.xml is<param name="respectDocumentOrder" value="true"/>. This means that when

a query does *not* have a 'order by' clause, result nodes will be in
document order. Returning nodes in document order for many results (>
1000) will become increasingly slow. You can fix this by either setting
respectDocumentOrder to false in your repository.xml (and in
workspace.xml if you have an existing workspace already) *or* by adding
an 'order by' clause in your query. Minutes delay will be decreased to
0-15ms

Conclusion 2: When you have a lot of results, either include an 'order
by' clause or set respectDocumentOrder to false. Modelling your content
in having many child nodes below one single node will make the problem
even larger when you have respectDocumentOrder = true and do not define
an 'order by' clause

Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]'

and it takes minutes to complete.

Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene
query. In order to prevent extremely slow WildcardQueries, a Wildcard
term should not start with one of the wildcards * or ?. So this is not a
Jackrabbit implementation detail, but a general Lucene (and I think
inverted indexes in general) issue [1]

Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when
searching for a specific word. If jcr:contains is not suitable, you can
work around the problem by creating a custom lucene analyzer for the
specific propery (see IndexingConfiguration [2] at Index Analyzers).

Question 4: I am not searching through nodes, but traversing, and this
is slow

Answer 4: Model your repository to not have very many child nodes
directly below a node. Try to structure your repository to have not
extremely 'large folders', comparable to how your FileSystem would
become slow

This mail is getting to long :-) I'll come up with ssome extra FAQ's
from time to time, and if people are interested I will make a (wiki?)
document for it. I though might need some help because at some parts my
knowledge might be insufficient

To be continued,

Regards Ard

[1]
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
g/apache/lucene/search/WildcardQuery.html
[2] http://wiki.apache.org/jackrabbit/IndexingConfiguration

Re: Explanation and solutions of some Jackrabbit queries regarding performance

Reply via email to