Re: Explanation and solutions of some Jackrabbit queries regarding performance

Alessandro Bologna Tue, 22 Jan 2008 14:38:03 -0800

+1 for putting this in the wiki. It's the better explanation i have
read insofar on how to optimize some queries on jackrabbit and why
some behave unexpectedly. The //foo being faster of /bar/baz/foo was
one of them.


Thanks!
Alessandro


On Jan 22, 2008 4:17 PM, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> Hello Martin Zdila regarding JCR-1196 et al,
>
> from time to time I see mails regarding performance of queries and slow
> things like queryResult.getNodes().hasNext(). There are queries which
> can be slow, there are data modelling structures which might be slow,
> and there are seemingly trivial things like
> queryResult.getNodes().hasNext() which might be slow. I write 'might'
> all the time, because everything can and must be blistering fast with
> millions of documents, and most of the time, solutions are extremely
> simple to achieve this. We just have to document some pitfalls of easy
> made mistakes. I'll try to find some time in the near future to document
> some parts I am aware of in the form of a FAQ, like the rest of this
> mail will be. For now just some frequently made mistakes from the top of
> my head:
>
> @Martin Zdila : if you are not interested in reading the rest of this
> mail, just add <param name="respectDocumentOrder" value="false"/> to the
> <SearchIndex> element of your workspace.xml (and repository.xml). Also
> try to avoid 4000 node childs (certainly same name nodes) under one
> node, try to create a larger tree where nodes to not contains many child
> nodes. This is just like your filesystem not fast
>
>
> Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c'
> or '//[EMAIL PROTECTED]' ?
>
> Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c'
> will be executed, the hierarchy manager has to check all found nodes
> wether their parents are correct. Since Jackrabbit does not store
> hierarchical data (if it would, it could not efficiently move a node
> anymore, at least in the current architecture), hierarchies need to be
> checked by iterating through the lucene indexes to find parent nodes of
> a result. This is cpu consuming. Although since Jackrabbit 1.4 the
> hierarchy is cached properly, returning many results is still an
> expensive operation. The first execution of a query might be slow
> because the hierarchy cache needs to be build up. Queries like '//c' or
> '//[EMAIL PROTECTED]' do not need to check hierarchies, because results do
> not need to check wether they are allowed according their parent node.
>
> Conclusion 1: When the resultset of the search is expected to be large,
> try to avoid path info in the xpath. Try to distinguish based on for
> example nodetype or some property.
>
> Question 2: My xpath was '//c' and the result size is 10.000 nodes. When
> I call queryResult.getNodes().hasNext() it takes up to minutes to
> complete this call.
>
> Answer 2: For Jackrabbit version < 1.5 , the default setting in the
> <SearchIndex> configuration in repository.xml is
> <param name="respectDocumentOrder" value="true"/>. This means that when
> a query does *not* have a 'order by' clause, result nodes will be in
> document order. Returning nodes in document order for many results (>
> 1000) will become increasingly slow. You can fix this by either setting
> respectDocumentOrder to false in your repository.xml (and in
> workspace.xml if you have an existing workspace already) *or* by adding
> an 'order by' clause in your query. Minutes delay will be decreased to
> 0-15ms
>
> Conclusion 2: When you have a lot of results, either include an 'order
> by' clause or set respectDocumentOrder to false. Modelling your content
> in having many child nodes below one single node will make the problem
> even larger when you have respectDocumentOrder = true and do not define
> an 'order by' clause
>
> Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]'
> and it takes minutes to complete.
>
> Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene
> query. In order to prevent extremely slow WildcardQueries, a Wildcard
> term should not start with one of the wildcards * or ?. So this is not a
> Jackrabbit implementation detail, but a general Lucene (and I think
> inverted indexes in general) issue [1]
>
> Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when
> searching for a specific word. If jcr:contains is not suitable, you can
> work around the problem by creating a custom lucene analyzer for the
> specific propery (see IndexingConfiguration [2] at Index Analyzers).
>
> Question 4: I am not searching through nodes, but traversing, and this
> is slow
>
> Answer 4: Model your repository to not have very many child nodes
> directly below a node. Try to structure your repository to have not
> extremely 'large folders', comparable to how your FileSystem would
> become slow
>
> This mail is getting to long :-) I'll come up with ssome extra FAQ's
> from time to time, and if people are interested I will make a (wiki?)
> document for it. I though might need some help because at some parts my
> knowledge might be insufficient
>
> To be continued,
>
> Regards Ard
>
> [1]
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
> g/apache/lucene/search/WildcardQuery.html
> [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration
>
> --
>
> Hippo
> Oosteinde 11
> 1017WT Amsterdam
> The Netherlands
> Tel  +31 (0)20 5224466
> -------------------------------------------------------------
> [EMAIL PROTECTED] / [EMAIL PROTECTED] / http://www.hippo.nl
> --------------------------------------------------------------
>

Re: Explanation and solutions of some Jackrabbit queries regarding performance

Reply via email to