Re: Stats.opt and query optimization

Andy Seaborne Mon, 12 Aug 2013 04:16:25 -0700

On 12/08/13 10:26, [email protected] wrote:

Hi,

(sorry - I didn't reply to the original; the lists are approaching thelevels of July and it's only the 12th)


Which version are you using?

The last release now calculates stats for rdf:type - if you don't see

((VAR <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ...) ...)

in the stats file, then it was presumably built before that change.

I got a question about how the query optimizer decides which triple pattern to 
evaluate first. My basic query is:


PREFIX ec: <http://www.eurocat.info/ontology/eurocat.owl#>
PREFIX nci: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#

SELECT *
WHERE {

?pat a ec:Patient .
?pat ec:Has_Disease ?Disease .
?Disease a ?DiseaseType .
?DiseaseType ec:descendantOf nci:Diseases_and_Disorders .

}



  My stats.opt file shows:


(where do the ";" come from?)

   (<http://www.eurocat.info/ontology/eurocat.owl#Has_Disease>; 755)
   (<http://www.eurocat.info/ontology/eurocat.owl#descendantOf>; 917730)

The execution plan for this query is:

(?DiseaseType <http://www.eurocat.info/ontology/eurocat.owl#descendantOf>; 
<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Diseases_and_Disorders>;)
(?Patient <http://www.eurocat.info/ontology/eurocat.owl#Has_Disease>; ?Disease)
(?Patient rdf:type <http://www.eurocat.info/ontology/eurocat.owl#Patient>;)
(?Disease rdf:type ?DiseaseType)


You can add to the stats file to help it.

( (VAR rdf:type <http://www.eurocat.info/ontology/eurocat.owl#Patient>)
 COUNT1)

from

SELECT (count(*) AS ?COUNT1)
{ ?v a <http://www.eurocat.info/ontology/eurocat.owl#Patient> }

The optimizer had a mild adversion to using rdf:type prior to 2.10.1

With my current test data, this query takes 43 seconds and returns
777rows. Since this struck me as very long and the execution plan seems
inefficient, I removed the stats.opt file to make it use the query path
"as is" and the execution time was reduced to 1 second!

descendantOf has a much bigger count in stats.opt than Has_Disease.
Is  that the reason why it chose to evaluate it first even though
Has_Disease is the better choice for narrowing down the result set quickly?

The query "?DiseaseType ec:descendantOf nci:Diseases_and_Disorders .
"

by itself takes 20 seconds. So this is obviously the expensive operation.


-Wolfgang

P.S.: descendantOf of is just rdfs:subClassOf* asserted directly.


Could you post the complete stats file please?

        Andy

Re: Stats.opt and query optimization

Reply via email to