Marcel, just wanted to get back to you (and the list as well). I downloaded jackrabbit-webapp-1.3-SNAPSHOT and run the same tests again. Performances are much better and queries seem to be much more optimized. Congratulations for the improvements. Alessandro
On 3/28/07, Marcel Reutegger <[EMAIL PROTECTED]> wrote:
Hi Alessandro, Alessandro Bologna wrote: > Now I have found another unusual behavior, and I was hoping you could > explain this too... > These queries have been executed in sequence (without restarting): > > > Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000] > Query execution time:10245ms > Number of nodes:91 > > > > Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and > @random<10000000] > Query execution time:20409ms > Number of nodes:91 > > > > Executing query: /jcr:root/load/n10/n33/[EMAIL PROTECTED]>10000 and > @random<10000000 and @random<10000001] > Query execution time:30053ms > Number of nodes:91 > > > I think that the execution time on the first query is already quite high > (an > equality query takes just a few millisecond), This has already been improved with http://issues.apache.org/jira/browse/JCR-804 > but what I am more > disconcerted about is that the second query (with two condition, the second > being a 'dummy' one since it is true for each of the 91 nodes returned by > the second query) takes double the time, and the third query (with the > third > condition being basically the same as the first one) takes three times as > much. > > Typically I would expect an 'and' query to be executed on the results of > the > first one, and therefore to take just a little bit less. > > So the questions are: > 1. why does it takes so long to find 91 nodes in the first query this is caused by: - MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804) - @random>10000 (probably) selects a great number of nodes, which are later excluded again because of the path constraint > 2. why the second and third query take as much time as the first times the > number of expressions? each of the expressions is evaluated independently and in a second step 'and'ed together. therefore the predominant cost in your query seems to be the individual expressions. because each of the range expressions selects a lot of nodes lucene cannot optimize the execution well. see above for a workaround. > 3. is there a workaround to do range queries? partitioning the random property into multiple properties may help. the basic idea is that you split the random number into a sum of multiple values. @random = 34045 would become: @random1 = 5 @random10 = 4 @random100 = 0 @random1000 = 4 @random10000 = 3 later if you search for all random properties with a value larger than 12000 you would have a query: //*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)] because the distinct values of the split up properties are small, lucene can much better optimize the query execution. regards marcel
