Stats.opt and query optimization

hueyl16 Tue, 06 Aug 2013 03:41:50 -0700

Hi,

I got a question about how the query optimizer decides which triple pattern to 
evaluate first. My basic query is:


SELECT *
WHERE {

?pat a ec:Patient .
?pat ec:Has_Disease ?Disease . 
?Disease a ?DiseaseType . 
?DiseaseType ec:descendantOf nci:Diseases_and_Disorders . 

}

 

 My stats.opt file shows:

  (<http://www.eurocat.info/ontology/eurocat.owl#Has_Disease> 755)
  (<http://www.eurocat.info/ontology/eurocat.owl#descendantOf> 917730)

The execution plan for this query is:

(?DiseaseType <http://www.eurocat.info/ontology/eurocat.owl#descendantOf> 
<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Diseases_and_Disorders>) 
(?Patient <http://www.eurocat.info/ontology/eurocat.owl#Has_Disease> ?Disease) 
(?Patient rdf:type <http://www.eurocat.info/ontology/eurocat.owl#Patient>) 
(?Disease rdf:type ?DiseaseType) 

With my current test data, this query takes 43 seconds and returns 777 rows. 
Since this struck me as very long and the execution plan seems inefficient, I 
removed the stats.opt file to make it use the query path as is and the 
execution time was reduced to 1 second!

descendantOf has a much bigger count in stats.opt than Has_Disease. Is that the 
reason why it chose to evaluate it first even though Has_Disease is the better 
choice for narrowing down the result set quickly?

The query "?DiseaseType ec:descendantOf nci:Diseases_and_Disorders . " by 
itself takes 20 seconds. So this is obviously the expensive operation.

-Wolfgang

P.S.: descendantOf of is just rdfs:subClassOf* asserted directly.

Stats.opt and query optimization

Reply via email to