Stats.opt and query optimization

hueyl16 Mon, 12 Aug 2013 02:27:56 -0700

Hi,


I got a question about how the query optimizer decides which triple pattern to 
evaluate first. My basic query is:

SELECT *
WHERE {

?pat a ec:Patient .
?pat ec:Has_Disease ?Disease . 
?Disease a ?DiseaseType . 
?DiseaseType ec:descendantOf nci:Diseases_and_Disorders . 

}

 

 My stats.opt file shows:

  (<http://www.eurocat.info/ontology/eurocat.owl#Has_Disease>; 755)
  (<http://www.eurocat.info/ontology/eurocat.owl#descendantOf>; 917730)

The execution plan for this query is:

(?DiseaseType <http://www.eurocat.info/ontology/eurocat.owl#descendantOf>; 
<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Diseases_and_Disorders>;) 
(?Patient <http://www.eurocat.info/ontology/eurocat.owl#Has_Disease>; ?Disease) 
(?Patient rdf:type <http://www.eurocat.info/ontology/eurocat.owl#Patient>;) 
(?Disease rdf:type ?DiseaseType) 

With my current test data, this query takes 43 seconds and returns 777 rows. 
Since this struck me as very long and the execution plan seems inefficient, I 
removed the stats.opt file to make it use the query path "as is" and the 
execution time was reduced to 1 second!

descendantOf has a much bigger count in stats.opt than Has_Disease. Is that the 
reason why it chose to evaluate it first even though Has_Disease is the better 
choice for narrowing down the result set quickly?

The query "?DiseaseType ec:descendantOf nci:Diseases_and_Disorders . " by 
itself takes 20 seconds. So this is obviously the expensive operation.

-Wolfgang

P.S.: descendantOf of is just rdfs:subClassOf* asserted directly.

Stats.opt and query optimization

Reply via email to