Hi Alessandro,

Alessandro Bologna wrote:
Now I have found another unusual behavior, and I was hoping you could
explain this too...
These queries have been executed in sequence (without restarting):


Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000]
Query execution time:10245ms
Number of nodes:91



Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and @random<10000000]
Query execution time:20409ms
Number of nodes:91



Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and
@random<10000000 and @random<10000001]
Query execution time:30053ms
Number of nodes:91


I think that the execution time on the first query is already quite high (an
equality query takes just a few millisecond),

This has already been improved with http://issues.apache.org/jira/browse/JCR-804

but what I am more
disconcerted about is that the second query (with two condition, the second
being a 'dummy' one since it is true for each of the 91 nodes returned by
the second query) takes double the time, and the third query (with the third
condition being basically the same as the first one) takes three times as
much.

Typically I would expect an 'and' query to be executed on the results of the
first one, and therefore to take just a little bit less.

So the questions are:
1. why does it takes so long to find 91 nodes in the first query

this is caused by:
- MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)
- @random>10000 (probably) selects a great number of nodes, which are later excluded again because of the path constraint

2. why the second and third query take as much time as the first times the
number of expressions?

each of the expressions is evaluated independently and in a second step 'and'ed together. therefore the predominant cost in your query seems to be the individual expressions. because each of the range expressions selects a lot of nodes lucene cannot optimize the execution well. see above for a workaround.

3. is there a workaround to do range queries?

partitioning the random property into multiple properties may help. the basic idea is that you split the random number into a sum of multiple values.

@random = 34045

would become:

@random1 = 5
@random10 = 4
@random100 = 0
@random1000 = 4
@random10000 = 3

later if you search for all random properties with a value larger than 12000 you would have a query:
//*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]

because the distinct values of the split up properties are small, lucene can much better optimize the query execution.

regards
 marcel

Reply via email to