Hi,

I am running into some performance issues and was wondering if I was 
approaching the problem from the correct angle or if there is something more 
efficient.

I have a property "Has_Finding" which is used to assert things about patients. 
There are different kinds of findings, e.g. Dyspnea, Dysphagia, Death, 
Hypertension ... So my data looks this:

patient1 ec:Has_Finding Dyspnea1 . Dyspnea1 a nci:Dyspnea_Score_2 .
patient1 ec:Has_Finding Dysphagia1 . Dysphagia1 a nci:Dysphagia_Score_1.
patient1 ec:Has_Finding Dyspnea2 . Dyspnea2 a nci:Dyspnea_Score_3.
patient2 ec:Has_Finding Dyspnea3. Dyspnea3 a nci:Dyspnea_Score_2.
patient2 ec:Has_Finding Hypertension1 . Hypertension1 a nci:Hypertension . 
etc.

My users want to know about findings. I am offering a GUI-based query tool and 
am generating the Sparql queries automatically. The users are unaware of Sparql 
or anything like that. 

So I can easily get all the findings with the simple sparql query:

SELECT ?pat ?findingType
WHERE { ?pat ec:Has_Finding ?finding . ?finding a ?findingType. } 

The problem is that the user has to figure out what finding a particular result 
row is talking about by looking at the value of ?findingType and by string 
comparison (or something unreliable like that) find out that this is actually 
representing a Dyspnea Score. Or a Dysphagia Score etc. before he/she can work 
with the value itself. But the next row could already be representing something 
else. This way of analyzing the result requires that the user has some semantic 
knowledge about the data. 

So I would like to make the query return the different types of findings as 
"columns" instead of "rows".

SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
WHERE {
  ?pat a ec:Patient .
  OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType . 
?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
  OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType . 
?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . } 
  OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType . ?hypertType 
rdfs:subClassOf* nci:Hypertension . } 
}

Of course there are more than just 3 different finding types. When executing 
these queries, I noticed that the more of these rows I am adding, the longer 
the execution time gets. Which is expected, but from a certain point it seems 
to increase by a factor of 2 or more.

While asking for 1 or 2 types of findings takes 5 seconds. Adding a 3rd takes 
10 seconds. After 5 we are up to 40 seconds. At 10 we are at 2-3 minutes and 
around 15 it takes so long that it is not feasible anymore. I am currently only 
testing on a fraction of my data (about 1000 patients). This query is a little 
simplified since there are other properties that I use, e.g. Has_Id, 
Has_Date_Observed. But these are all datatype properties pointing to literals 
so I left them out for sake of simplicity.

My questions are:
1. Is that a common/expected query?
2. Is there a different way of achieving this without making the query analyze 
these rdfs:subClassOf* triples.
3. Should I just use the first, general query and then do some post-processing 
in my code to split it up into columns before passing it to the user?

Thanks for any help!
Wolfgang

Reply via email to