Re: Query performances

Marcel Reutegger Wed, 28 Mar 2007 01:09:32 -0800

Hi Alessandro,

Alessandro Bologna wrote:

Now I have found another unusual behavior, and I was hoping you could
explain this too...
These queries have been executed in sequence (without restarting):



Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000]
Query execution time:10245ms
Number of nodes:91

Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and@random<10000000]

Query execution time:20409ms
Number of nodes:91



Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and
@random<10000000 and @random<10000001]
Query execution time:30053ms
Number of nodes:91

I think that the execution time on the first query is already quite high(an

equality query takes just a few millisecond),


This has already been improved with http://issues.apache.org/jira/browse/JCR-804

but what I am more
disconcerted about is that the second query (with two condition, the second
being a 'dummy' one since it is true for each of the 91 nodes returned by

the second query) takes double the time, and the third query (with thethird

condition being basically the same as the first one) takes three times as
much.

Typically I would expect an 'and' query to be executed on the results ofthe

first one, and therefore to take just a little bit less.

So the questions are:
1. why does it takes so long to find 91 nodes in the first query


this is caused by:
- MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)

- @random>10000 (probably) selects a great number of nodes, which are laterexcluded again because of the path constraint

2. why the second and third query take as much time as the first times the
number of expressions?

each of the expressions is evaluated independently and in a second step 'and'edtogether. therefore the predominant cost in your query seems to be theindividual expressions. because each of the range expressions selects a lot ofnodes lucene cannot optimize the execution well. see above for a workaround.

3. is there a workaround to do range queries?

partitioning the random property into multiple properties may help. the basicidea is that you split the random number into a sum of multiple values.


@random = 34045

would become:

@random1 = 5
@random10 = 4
@random100 = 0
@random1000 = 4
@random10000 = 3

later if you search for all random properties with a value larger than 12000 youwould have a query:

//*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]

because the distinct values of the split up properties are small, lucene canmuch better optimize the query execution.


regards
 marcel

Re: Query performances

Reply via email to