Re: Query performances

Marcel Reutegger Mon, 19 Mar 2007 01:40:17 -0800

Hi Alessandro,

Alessandro Bologna wrote:

We have been incurring in an interesting behavior doing searches on a quite
large repository (~1,000,000 nodes).
The test data is made of a tree of nodes of type nt:unstructured, reference
able, with two numeric properties (a sequential count of the node and a
random number between 0 and the count). Each node has a reference to the

parent, and up to 100 child nodes, and is named n<m> where m is theindex of

the node, related to the parent node.
So, for instance, /load/n0 is the first node, /load/n1 the second to
/load/n99.
Then each one of them has 100 children and so on, so that a valid path, for
instance, is /load/n23/n34/n50.

One node out of 6 has attached a nt:file node as well, in order to testfull

text searches. If requested, I can provide the code to create the test set.


The strange behavior that prompted me to write to this mailing list, is the
following:

Say that I am searching for a node that contains the word 'beatles' at some
level under the node /load/n40 and I use the following query:
*/jcr:root/load/n40//*[jcr:contains(.,'beatles')]* the execution time is
1672ms
If I use instead:
*/jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')]*  the execution time
is 19749ms

The second query, in theory, could execute faster than the first, because I
am providing more information (only nodes at the 4th level under /load/n40)
but takes 10 times longer to execute.
Is there a reason why?


there are basically two reasons why the second query takes more time to execute:

- the index does not contain depth (level) information of a node. the depth of anode is not stable and may change even if the node itself is not changed. if asubtree of nodes is moved to another location the depth of all nodes in thesubtree changes. the query handler would have to re-index the whole subtree.- multiple child axis with just a * as name test are not optimized./jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')] is translated intomultiple ChildAxisQuerys each resolves the context nodes and provides a newcontext with the nodes that are the child nodes of the previous context.internally the query handler will temporarily have a set of nodes that includesall nodes at level 4 under /jcr:root/load/n40. for the query/jcr:root/load/n40//*[jcr:contains(.,'beatles')] the index will look up thenodes that match the fulltext condition and then filter out the ones that do nothave /jcr:root/load/n40 as an ancestor. that operation involves less nodes andexecutes faster.

The other, way more worrisome problem, appears to be the opposite:
I have executed the following two queries
/jcr:root/load/n50/n2/* ==> 931ms
/jcr:root/load/n50/n2/*/* ==> 661ms

that's indeed strange. maybe you get this result because the cache is filled upby the first query and the second one can take advantage of the pre-filledcache. can you please run those queries a second time just to make sure thatboth run against the same cache state?

The first is returning all nodes one level below /load/n50/n2 and thesecond

two levels below. There are no other nodes under that.
When I tried the following query, which would return the same nodes in one
operation, the result was surprising (in a bad way)
/jcr:root/load/n50/n2//* ==>*353769ms*
**

The CPU goes 100%, I see in the jackrabbit logs a lot of entries similarto:

DocNumberCache: size=1024/1024, #accesses=17039, #hits=167, #misses=16872,
cacheRatio=1% (DocNumberCache.java, line 155)

and then finally, *some 5 minutes later*, I get the result.
Even if I restrict the query, it still takes the same time:
/jcr:root/load/n50/n2/m96//* and there's maybe only an hundred nodes under
that.

unfortunately those are queries that are not optimized at all and will result ina full index traversal. see below for a workaround.

I have the exact same behavior if I try with the SQL syntax: select * from
nt:base where jcr:path like '/load/n50/n2/n96/%'

that's because this query is equivalent to the above XPath statement and ifthat's the case the lucene query, which is executed ultimately, is the same.

The version of JR is 1.2.2. The backend is Oracle 10g, and I am running the
application on Tomcat 5.5 with jdk 1.5 and 1GB assigned to the JVM (on
Windows)

Does anybody have any idea on why is this happening and if there is a
workaround?

I thought no one would actually execute such queries and didn't bother tooptimize them because there's a simple workaround:


Node rootNode = ...
rootNode.getNode("load/n50/n2").accept(new TraversingItemVisitor.Default() {
            protected void entering(Node node, int level)
                                 throws RepositoryException {
                node.getPath();
            }
        });

If you feel this should be done efficiently by the query handler please file ajira issue. Thanks.


regards
 marcel

Re: Query performances

Reply via email to