Hi Arid, Having read through your email and http://www.nabble.com/Explanation-and-solutions-of-some-Jackrabbit-queries-r egarding-performance-td15028655.html in more detail with respect to: * <param name="respectDocumentOrder" value="false"/>
....if we have JCR types that are explicitly "ordered" will making the above change mean that all ordering is ignored? We have nodes with same-name-siblings which we need to be returned in the right order. Or is it just an issue where there is no default ordering. Has anyone actually indexed path information? We'd naively assumed limiting queries using jcr:path was the best way to ensure performance with large data sets. Regards, Shaun -----Original Message----- From: Ard Schrijvers [mailto:[EMAIL PROTECTED] Sent: 10 August 2008 11:14 To: [email protected] Subject: RE: JCR Query Result Caching Hello Shaun, First of all let me point you to a set of tips I wrote some time ago about performance for queries, see [1]. > > Hi all, > > As our data set increases the overhead of executed JCR > queries is increasing. For example, we typically want to > display the top 3 latest BlogEntries on a page requiring > "select * from acme:BlogEntries where jcr:path like > '/home/myblog/%'". Profiling shows Lucene access to be a > hotspot under load. Noted that we can review our node > structure but ... After reading the link above, you probably know where the bottleneck is in: the path '/home/myblog/%' will become the bottleneck. I am not sure what kind of numbers of nodes you are talking about? Like 1000 blog entries, of 1.000.000? (see below for a workaround for you problem, which I am sure of will solve your issue) > > Q1: Does JackRabbit provide any facilities to cache the > results of queries such that they can be shared by concurrent > sessions for a particular time to live? Define result of a query? Are you talking about the result of lucene, or the jr queryresult for instance? Anyway, to answer, lucene has internal caching, and jackrabbit has a cache for hierarchical relation (which are needed a lot for your queries since you have '/home/myblog/%'). Also not that < 1.4 jr version (from top of my head, so not 100% sure) this hierarchical cache was broken. So it also depends on your version. Furthermore, I do not think caching the jr QueryResult is a good idea, and it might be session dependant whether some nodes are allowed in the result or not. > > As a query returns a set of JCR Nodes, which in turn are > session specific, I'm assuming caching query results is > tricky. Caching query results quickly brings us into the > realm of transactional semantics, isolation levels etc. Yes, and you won't find your performance improve here either (obviously, it will be fast when cached, but not the way to go). Furthermore, certainly because nodes are lazily fetched, it is not the fetching of nodes which is slow (unless you want thousands of results at once), but it is your hierarchical query. > > I'd be interested to hear any experiences in attempting to > cache JCR query results? So, probably by now, you'll know that your bottleneck most likely lies within the lucene search. If you have <param name="respectDocumentOrder" value="true"/> (see [1]), this might also be a big performance hit. So, I think you have 3 options, two involving extending some jackrabbit code, which you might not like: 1) extend the SearchIndex, and cache certain lucene queries (not exactly sure which and how, but might be coupled to the kind of queries you are using) 2) during indexing, also index path information (extend NodeIndexer). When searching for simple path expression, like /foo/bar//* you can easily match this to one single lucene term, which is blistering fast up to millions of nodes. Though realize, you give up the almost free of charge moving of nodes in Jackrabbit. It is a simple trade-off. 3) if you do not want any programming, than change your sql/xpath query. Basically, from [1] you should know what the problem is: if the 'where' clause returns many results (if you don't have a where, it will be all), all results need to be checked for the path whether they should be included or not. So, if you can limit the initial set by thinking of some where clause that returns less hits, the query will become faster. So you want the last three BlogEntries added, right? So you have a timestamp property most likely. Now, suppose on average every day 10 entries are added, then, adding to the 'where' clause a constraint that says: only nodes where timestamp > lastweektimestamp. Now, lots of results less will be needed to be checked for their path constraint. Still, results from all over the repository added last week will be in the result after the 'where' clause. If you also know, that blog entries are of some specific node type, add this information in the 'where' to only include those nodes which are of type 'blog'. Quite sure that if you follow the ideas from point 3, your queries will be more then fast enough for millions of nodes in the repository, where the query you have now probably slows down after several tens of thousands nodes. Hopefully you are helped with this info, -Ard [1] http://wiki.apache.org/jackrabbit/Performance > Regards, > > Shaun > >
