Wolfgang,

I think the problem is that you are creating a partial cross product. This leads to two problems : a lot of redundancy in the answer

If you ask and get 2 items and

>> SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
>> WHERE {
>>    ?pat a ec:Patient .
>>   OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType .
>> ?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
>> OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType
>> . ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . }
>>    OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType .
>> ?hypertType rdfs:subClassOf* nci:Hypertension . }
>> }

(aside - please provide complete queries, inc prefixes. I often have to reformat them to make them readable after email has been involved and I use either sparql.org or qparse to do that. Both read in a query and print it out again but need a complete query to parse)

I'll try to simnpify:

SELECT *
{
   ?pat a ec:Patient
   OPTIONAL { ?pat ec:Has_Finding ?X1 .
              ?X1 a ?T1 .
              ?T1 rdfs:subClassOf* :Z1}
   OPTIONAL { ?pat ec:Has_Finding ?X2 .
              ?X2 a ?T2 .
              ?T2 rdfs:subClassOf* :Z2}

...
}

For every X1/T1/Z1, ARQ will try X2/T2/Z2 so the numbers grow worse than linearly.

if number of(X1/T1/Z1) is 2 and number of(X2/T2/Z2) is 3, there are 6 rows and 3 rows have one of the two set of (X1/T1/Z1) values and the other 3 have the other set.

If it were 5 and 10, there are 50 rows.

What you can do is use UNION.

SELECT *
{
   ?pat a ec:Patient
   {
   { ?pat ec:Has_Finding ?X1 .
              ?X1 a ?T1 .
              ?T1 rdfs:subClassOf* :Z1}
   UNION { ?pat ec:Has_Finding ?X2 .
              ?X2 a ?T2 .
              ?T2 rdfs:subClassOf* :Z2}
   UNION { ... }

...
}

This will put 2 rows for (X1/T1/Z1), with cols for (X2/T2/Z2) being unset, and 3 rows for (X2/T2/Z2) with (X1/T1/Z1) being unset. 5 rows.

At small numbers, not so much difference but if it were 5 and 10, there are 15 rows as opposed to 50.

        Andy

On 25/07/13 14:14, Olivier Rossel wrote:
(My answer does not help very much, but anyway :)

I am not sure the spreadsheet view is ok for very nested data structures
(that makes too many columns in the end).

I would use a CONSTRCUT to retrieve a graph of
patient-findings-findingType, then do some post-processing to build a
facetable data structure.
cf this example:
http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/test.html
coming from that JSON:
http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/data.json#



On Thu, Jul 25, 2013 at 11:07 AM, <[email protected]> wrote:

Hi,

I am running into some performance issues and was wondering if I was
approaching the problem from the correct angle or if there is something
more efficient.

I have a property "Has_Finding" which is used to assert things about
patients. There are different kinds of findings, e.g. Dyspnea, Dysphagia,
Death, Hypertension ... So my data looks this:

patient1 ec:Has_Finding Dyspnea1 . Dyspnea1 a nci:Dyspnea_Score_2 .
patient1 ec:Has_Finding Dysphagia1 . Dysphagia1 a nci:Dysphagia_Score_1.
patient1 ec:Has_Finding Dyspnea2 . Dyspnea2 a nci:Dyspnea_Score_3.
patient2 ec:Has_Finding Dyspnea3. Dyspnea3 a nci:Dyspnea_Score_2.
patient2 ec:Has_Finding Hypertension1 . Hypertension1 a nci:Hypertension .
etc.

My users want to know about findings. I am offering a GUI-based query tool
and am generating the Sparql queries automatically. The users are unaware
of Sparql or anything like that.

So I can easily get all the findings with the simple sparql query:

SELECT ?pat ?findingType
WHERE { ?pat ec:Has_Finding ?finding . ?finding a ?findingType. }

The problem is that the user has to figure out what finding a particular
result row is talking about by looking at the value of ?findingType and by
string comparison (or something unreliable like that) find out that this is
actually representing a Dyspnea Score. Or a Dysphagia Score etc. before
he/she can work with the value itself. But the next row could already be
representing something else. This way of analyzing the result requires that
the user has some semantic knowledge about the data.

So I would like to make the query return the different types of findings
as "columns" instead of "rows".

SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
WHERE {
   ?pat a ec:Patient .
   OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType .
?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
   OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType
. ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . }
   OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType .
?hypertType rdfs:subClassOf* nci:Hypertension . }
}

Of course there are more than just 3 different finding types. When
executing these queries, I noticed that the more of these rows I am adding,
the longer the execution time gets. Which is expected, but from a certain
point it seems to increase by a factor of 2 or more.

While asking for 1 or 2 types of findings takes 5 seconds. Adding a 3rd
takes 10 seconds. After 5 we are up to 40 seconds. At 10 we are at 2-3
minutes and around 15 it takes so long that it is not feasible anymore. I
am currently only testing on a fraction of my data (about 1000 patients).
This query is a little simplified since there are other properties that I
use, e.g. Has_Id, Has_Date_Observed. But these are all datatype properties
pointing to literals so I left them out for sake of simplicity.

My questions are:
1. Is that a common/expected query?
2. Is there a different way of achieving this without making the query
analyze these rdfs:subClassOf* triples.
3. Should I just use the first, general query and then do some
post-processing in my code to split it up into columns before passing it to
the user?

Thanks for any help!
Wolfgang




Reply via email to