Hi Andy,

thanks for the UNION suggestion. I am going to give this a shot. Ultimately I 
am looking at 40-50 of these optional triple patterns in one query for about 
50.000 patients. I am a little worried about how the query performance is going 
to be.

I noticed the increased redundancy. It seemed a little unexpected to me coming 
from the relational DB world where a query like that would be quite fast and 
create a fairly efficient result set. But I guess the DB has more hierarchical 
knowledge and would be designed for exactly this purpose whereas a triple store 
has to accommodate a lot more angles.

Would more specific sub-properties be of any help? Instead of Has_Finding use 
Has_Dyspnea, Has_Dysphagia etc. Something like that would only affect the query 
itself (easier to find the triples and the rdfs:subClassOf* triple pattern is 
not required anymore), not the number of rows in the result set, correct? Since 
it still has to do the cross-product? 

Would it be better to just get all Has_Finding triples and format the result in 
custom post-processing?

Or should I split it into multiple queries and combine the results in 
post-processing?

I was playing around with sub-queries a little bit, but I noticed that a 
variable bound in the main query is not passed to the sub-query. So I did not 
make much progress there.

 


 

-Wolfgang


-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Thu, Jul 25, 2013 10:54 pm
Subject: Re: Converting result rows to columns in Sparql


Wolfgang,

I think the problem is that you are creating a partial cross product. 
This leads to two problems : a lot of redundancy in the answer

If you ask and get 2 items and

 >> SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
 >> WHERE {
 >>    ?pat a ec:Patient .
 >>   OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType .
 >> ?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
 >>    OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a 
?dysphagiaType
 >> . ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . }
 >>    OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType .
 >> ?hypertType rdfs:subClassOf* nci:Hypertension . }
 >> }

(aside - please provide complete queries, inc prefixes.  I often have to 
reformat them to make them readable after email has been involved and I 
use either sparql.org or qparse to do that.  Both read in a query and 
print it out again but need a complete query to parse)

I'll try to simnpify:

SELECT *
{
    ?pat a ec:Patient
    OPTIONAL { ?pat ec:Has_Finding ?X1 .
               ?X1 a ?T1 .
               ?T1 rdfs:subClassOf* :Z1}
    OPTIONAL { ?pat ec:Has_Finding ?X2 .
               ?X2 a ?T2 .
               ?T2 rdfs:subClassOf* :Z2}

...
}

For every X1/T1/Z1, ARQ will try X2/T2/Z2 so the numbers grow worse than 
linearly.

if number of(X1/T1/Z1) is 2 and number of(X2/T2/Z2) is 3, there are 6 
rows and 3 rows have one of the two set of (X1/T1/Z1) values and the 
other 3 have the other set.

If it were 5 and 10, there are 50 rows.

What you can do is use UNION.

SELECT *
{
    ?pat a ec:Patient
    {
    { ?pat ec:Has_Finding ?X1 .
               ?X1 a ?T1 .
               ?T1 rdfs:subClassOf* :Z1}
    UNION { ?pat ec:Has_Finding ?X2 .
               ?X2 a ?T2 .
               ?T2 rdfs:subClassOf* :Z2}
    UNION { ... }

...
}

This will put 2 rows for (X1/T1/Z1), with cols for (X2/T2/Z2) being 
unset, and 3 rows for (X2/T2/Z2) with (X1/T1/Z1) being unset. 5 rows.

At small numbers, not so much difference but if it were 5 and 10, there 
are 15 rows as opposed to 50.

        Andy

On 25/07/13 14:14, Olivier Rossel wrote:
> (My answer does not help very much, but anyway :)
>
> I am not sure the spreadsheet view is ok for very nested data structures
> (that makes too many columns in the end).
>
> I would use a CONSTRCUT to retrieve a graph of
> patient-findings-findingType, then do some post-processing to build a
> facetable data structure.
> cf this example:
> http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/test.html
> coming from that JSON:
> http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/data.json#
>
>
>
> On Thu, Jul 25, 2013 at 11:07 AM, <[email protected]> wrote:
>
>> Hi,
>>
>> I am running into some performance issues and was wondering if I was
>> approaching the problem from the correct angle or if there is something
>> more efficient.
>>
>> I have a property "Has_Finding" which is used to assert things about
>> patients. There are different kinds of findings, e.g. Dyspnea, Dysphagia,
>> Death, Hypertension ... So my data looks this:
>>
>> patient1 ec:Has_Finding Dyspnea1 . Dyspnea1 a nci:Dyspnea_Score_2 .
>> patient1 ec:Has_Finding Dysphagia1 . Dysphagia1 a nci:Dysphagia_Score_1.
>> patient1 ec:Has_Finding Dyspnea2 . Dyspnea2 a nci:Dyspnea_Score_3.
>> patient2 ec:Has_Finding Dyspnea3. Dyspnea3 a nci:Dyspnea_Score_2.
>> patient2 ec:Has_Finding Hypertension1 . Hypertension1 a nci:Hypertension .
>> etc.
>>
>> My users want to know about findings. I am offering a GUI-based query tool
>> and am generating the Sparql queries automatically. The users are unaware
>> of Sparql or anything like that.
>>
>> So I can easily get all the findings with the simple sparql query:
>>
>> SELECT ?pat ?findingType
>> WHERE { ?pat ec:Has_Finding ?finding . ?finding a ?findingType. }
>>
>> The problem is that the user has to figure out what finding a particular
>> result row is talking about by looking at the value of ?findingType and by
>> string comparison (or something unreliable like that) find out that this is
>> actually representing a Dyspnea Score. Or a Dysphagia Score etc. before
>> he/she can work with the value itself. But the next row could already be
>> representing something else. This way of analyzing the result requires that
>> the user has some semantic knowledge about the data.
>>
>> So I would like to make the query return the different types of findings
>> as "columns" instead of "rows".
>>
>> SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType
>> WHERE {
>>    ?pat a ec:Patient .
>>    OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType .
>> ?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . }
>>    OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType
>> . ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . }
>>    OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType .
>> ?hypertType rdfs:subClassOf* nci:Hypertension . }
>> }
>>
>> Of course there are more than just 3 different finding types. When
>> executing these queries, I noticed that the more of these rows I am adding,
>> the longer the execution time gets. Which is expected, but from a certain
>> point it seems to increase by a factor of 2 or more.
>>
>> While asking for 1 or 2 types of findings takes 5 seconds. Adding a 3rd
>> takes 10 seconds. After 5 we are up to 40 seconds. At 10 we are at 2-3
>> minutes and around 15 it takes so long that it is not feasible anymore. I
>> am currently only testing on a fraction of my data (about 1000 patients).
>> This query is a little simplified since there are other properties that I
>> use, e.g. Has_Id, Has_Date_Observed. But these are all datatype properties
>> pointing to literals so I left them out for sake of simplicity.
>>
>> My questions are:
>> 1. Is that a common/expected query?
>> 2. Is there a different way of achieving this without making the query
>> analyze these rdfs:subClassOf* triples.
>> 3. Should I just use the first, general query and then do some
>> post-processing in my code to split it up into columns before passing it to
>> the user?
>>
>> Thanks for any help!
>> Wolfgang
>>
>>
>


 

Reply via email to