+1 for putting this in the wiki. It's the better explanation i have read insofar on how to optimize some queries on jackrabbit and why some behave unexpectedly. The //foo being faster of /bar/baz/foo was one of them.
Thanks! Alessandro On Jan 22, 2008 4:17 PM, Ard Schrijvers <[EMAIL PROTECTED]> wrote: > Hello Martin Zdila regarding JCR-1196 et al, > > from time to time I see mails regarding performance of queries and slow > things like queryResult.getNodes().hasNext(). There are queries which > can be slow, there are data modelling structures which might be slow, > and there are seemingly trivial things like > queryResult.getNodes().hasNext() which might be slow. I write 'might' > all the time, because everything can and must be blistering fast with > millions of documents, and most of the time, solutions are extremely > simple to achieve this. We just have to document some pitfalls of easy > made mistakes. I'll try to find some time in the near future to document > some parts I am aware of in the form of a FAQ, like the rest of this > mail will be. For now just some frequently made mistakes from the top of > my head: > > @Martin Zdila : if you are not interested in reading the rest of this > mail, just add <param name="respectDocumentOrder" value="false"/> to the > <SearchIndex> element of your workspace.xml (and repository.xml). Also > try to avoid 4000 node childs (certainly same name nodes) under one > node, try to create a larger tree where nodes to not contains many child > nodes. This is just like your filesystem not fast > > > Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c' > or '//[EMAIL PROTECTED]' ? > > Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c' > will be executed, the hierarchy manager has to check all found nodes > wether their parents are correct. Since Jackrabbit does not store > hierarchical data (if it would, it could not efficiently move a node > anymore, at least in the current architecture), hierarchies need to be > checked by iterating through the lucene indexes to find parent nodes of > a result. This is cpu consuming. Although since Jackrabbit 1.4 the > hierarchy is cached properly, returning many results is still an > expensive operation. The first execution of a query might be slow > because the hierarchy cache needs to be build up. Queries like '//c' or > '//[EMAIL PROTECTED]' do not need to check hierarchies, because results do > not need to check wether they are allowed according their parent node. > > Conclusion 1: When the resultset of the search is expected to be large, > try to avoid path info in the xpath. Try to distinguish based on for > example nodetype or some property. > > Question 2: My xpath was '//c' and the result size is 10.000 nodes. When > I call queryResult.getNodes().hasNext() it takes up to minutes to > complete this call. > > Answer 2: For Jackrabbit version < 1.5 , the default setting in the > <SearchIndex> configuration in repository.xml is > <param name="respectDocumentOrder" value="true"/>. This means that when > a query does *not* have a 'order by' clause, result nodes will be in > document order. Returning nodes in document order for many results (> > 1000) will become increasingly slow. You can fix this by either setting > respectDocumentOrder to false in your repository.xml (and in > workspace.xml if you have an existing workspace already) *or* by adding > an 'order by' clause in your query. Minutes delay will be decreased to > 0-15ms > > Conclusion 2: When you have a lot of results, either include an 'order > by' clause or set respectDocumentOrder to false. Modelling your content > in having many child nodes below one single node will make the problem > even larger when you have respectDocumentOrder = true and do not define > an 'order by' clause > > Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]' > and it takes minutes to complete. > > Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene > query. In order to prevent extremely slow WildcardQueries, a Wildcard > term should not start with one of the wildcards * or ?. So this is not a > Jackrabbit implementation detail, but a general Lucene (and I think > inverted indexes in general) issue [1] > > Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when > searching for a specific word. If jcr:contains is not suitable, you can > work around the problem by creating a custom lucene analyzer for the > specific propery (see IndexingConfiguration [2] at Index Analyzers). > > Question 4: I am not searching through nodes, but traversing, and this > is slow > > Answer 4: Model your repository to not have very many child nodes > directly below a node. Try to structure your repository to have not > extremely 'large folders', comparable to how your FileSystem would > become slow > > This mail is getting to long :-) I'll come up with ssome extra FAQ's > from time to time, and if people are interested I will make a (wiki?) > document for it. I though might need some help because at some parts my > knowledge might be insufficient > > To be continued, > > Regards Ard > > [1] > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or > g/apache/lucene/search/WildcardQuery.html > [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration > > -- > > Hippo > Oosteinde 11 > 1017WT Amsterdam > The Netherlands > Tel +31 (0)20 5224466 > ------------------------------------------------------------- > [EMAIL PROTECTED] / [EMAIL PROTECTED] / http://www.hippo.nl > -------------------------------------------------------------- >
