Hi,
I am running into some performance issues and was wondering if I was
approaching the problem from the correct angle or if there is something more
efficient.
I have a property "Has_Finding" which is used to assert things about patients.
There are different kinds of findings, e.g. Dyspnea, Dysphagia, Death,
Hypertension ... So my data looks this:
patient1 ec:Has_Finding Dyspnea1 . Dyspnea1 a nci:Dyspnea_Score_2 .
patient1 ec:Has_Finding Dysphagia1 . Dysphagia1 a nci:Dysphagia_Score_1.
patient1 ec:Has_Finding Dyspnea2 . Dyspnea2 a nci:Dyspnea_Score_3.
patient2 ec:Has_Finding Dyspnea3. Dyspnea3 a nci:Dyspnea_Score_2.
patient2 ec:Has_Finding Hypertension1 . Hypertension1 a nci:Hypertension .
etc.
My users want to know about findings. I am offering a GUI-based query tool and
am generating the Sparql queries automatically. The users are unaware of Sparql
or anything like that.
So I can easily get all the findings with the simple sparql query:
SELECT ?pat ?findingType
WHERE { ?pat ec:Has_Finding ?finding . ?finding a ?findingType. }
The problem is that the user has to figure out what finding a particular result
row is talking about by looking at the value of ?findingType and by string
comparison (or something unreliable like that) find out that this is actually
representing a Dyspnea Score. Or a Dysphagia Score etc. before he/she can work
with the value itself. But the next row could already be representing something
else. This way of analyzing the result requires that the user has some semantic
knowledge about the data.
So I would like to make the query return the different types of findings as
"columns" instead of "rows".
SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
WHERE {
?pat a ec:Patient .
OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType .
?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType .
?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . }
OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType . ?hypertType
rdfs:subClassOf* nci:Hypertension . }
}
Of course there are more than just 3 different finding types. When executing
these queries, I noticed that the more of these rows I am adding, the longer
the execution time gets. Which is expected, but from a certain point it seems
to increase by a factor of 2 or more.
While asking for 1 or 2 types of findings takes 5 seconds. Adding a 3rd takes
10 seconds. After 5 we are up to 40 seconds. At 10 we are at 2-3 minutes and
around 15 it takes so long that it is not feasible anymore. I am currently only
testing on a fraction of my data (about 1000 patients). This query is a little
simplified since there are other properties that I use, e.g. Has_Id,
Has_Date_Observed. But these are all datatype properties pointing to literals
so I left them out for sake of simplicity.
My questions are:
1. Is that a common/expected query?
2. Is there a different way of achieving this without making the query analyze
these rdfs:subClassOf* triples.
3. Should I just use the first, general query and then do some post-processing
in my code to split it up into columns before passing it to the user?
Thanks for any help!
Wolfgang