Hi,
I got a question about how the query optimizer decides which triple pattern to
evaluate first. My basic query is:
SELECT *
WHERE {
?pat a ec:Patient .
?pat ec:Has_Disease ?Disease .
?Disease a ?DiseaseType .
?DiseaseType ec:descendantOf nci:Diseases_and_Disorders .
}
My stats.opt file shows:
(<http://www.eurocat.info/ontology/eurocat.owl#Has_Disease> 755)
(<http://www.eurocat.info/ontology/eurocat.owl#descendantOf> 917730)
The execution plan for this query is:
(?DiseaseType <http://www.eurocat.info/ontology/eurocat.owl#descendantOf>
<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Diseases_and_Disorders>)
(?Patient <http://www.eurocat.info/ontology/eurocat.owl#Has_Disease> ?Disease)
(?Patient rdf:type <http://www.eurocat.info/ontology/eurocat.owl#Patient>)
(?Disease rdf:type ?DiseaseType)
With my current test data, this query takes 43 seconds and returns 777 rows.
Since this struck me as very long and the execution plan seems inefficient, I
removed the stats.opt file to make it use the query path as is and the
execution time was reduced to 1 second!
descendantOf has a much bigger count in stats.opt than Has_Disease. Is that the
reason why it chose to evaluate it first even though Has_Disease is the better
choice for narrowing down the result set quickly?
The query "?DiseaseType ec:descendantOf nci:Diseases_and_Disorders . " by
itself takes 20 seconds. So this is obviously the expensive operation.
-Wolfgang
P.S.: descendantOf of is just rdfs:subClassOf* asserted directly.