Hi Alessandro,
Alessandro Bologna wrote:
Now I have found another unusual behavior, and I was hoping you could
explain this too...
These queries have been executed in sequence (without restarting):
Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000]
Query execution time:10245ms
Number of nodes:91
Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and
@random<10000000]
Query execution time:20409ms
Number of nodes:91
Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and
@random<10000000 and @random<10000001]
Query execution time:30053ms
Number of nodes:91
I think that the execution time on the first query is already quite high
(an
equality query takes just a few millisecond),
This has already been improved with http://issues.apache.org/jira/browse/JCR-804
but what I am more
disconcerted about is that the second query (with two condition, the second
being a 'dummy' one since it is true for each of the 91 nodes returned by
the second query) takes double the time, and the third query (with the
third
condition being basically the same as the first one) takes three times as
much.
Typically I would expect an 'and' query to be executed on the results of
the
first one, and therefore to take just a little bit less.
So the questions are:
1. why does it takes so long to find 91 nodes in the first query
this is caused by:
- MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)
- @random>10000 (probably) selects a great number of nodes, which are later
excluded again because of the path constraint
2. why the second and third query take as much time as the first times the
number of expressions?
each of the expressions is evaluated independently and in a second step 'and'ed
together. therefore the predominant cost in your query seems to be the
individual expressions. because each of the range expressions selects a lot of
nodes lucene cannot optimize the execution well. see above for a workaround.
3. is there a workaround to do range queries?
partitioning the random property into multiple properties may help. the basic
idea is that you split the random number into a sum of multiple values.
@random = 34045
would become:
@random1 = 5
@random10 = 4
@random100 = 0
@random1000 = 4
@random10000 = 3
later if you search for all random properties with a value larger than 12000 you
would have a query:
//*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]
because the distinct values of the split up properties are small, lucene can
much better optimize the query execution.
regards
marcel