Hi Andy, thanks for the UNION suggestion. I am going to give this a shot. Ultimately I am looking at 40-50 of these optional triple patterns in one query for about 50.000 patients. I am a little worried about how the query performance is going to be.
I noticed the increased redundancy. It seemed a little unexpected to me coming from the relational DB world where a query like that would be quite fast and create a fairly efficient result set. But I guess the DB has more hierarchical knowledge and would be designed for exactly this purpose whereas a triple store has to accommodate a lot more angles. Would more specific sub-properties be of any help? Instead of Has_Finding use Has_Dyspnea, Has_Dysphagia etc. Something like that would only affect the query itself (easier to find the triples and the rdfs:subClassOf* triple pattern is not required anymore), not the number of rows in the result set, correct? Since it still has to do the cross-product? Would it be better to just get all Has_Finding triples and format the result in custom post-processing? Or should I split it into multiple queries and combine the results in post-processing? I was playing around with sub-queries a little bit, but I noticed that a variable bound in the main query is not passed to the sub-query. So I did not make much progress there. -Wolfgang -----Original Message----- From: Andy Seaborne <[email protected]> To: users <[email protected]> Sent: Thu, Jul 25, 2013 10:54 pm Subject: Re: Converting result rows to columns in Sparql Wolfgang, I think the problem is that you are creating a partial cross product. This leads to two problems : a lot of redundancy in the answer If you ask and get 2 items and >> SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType >> WHERE { >> ?pat a ec:Patient . >> OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType . >> ?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . } >> OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType >> . ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . } >> OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType . >> ?hypertType rdfs:subClassOf* nci:Hypertension . } >> } (aside - please provide complete queries, inc prefixes. I often have to reformat them to make them readable after email has been involved and I use either sparql.org or qparse to do that. Both read in a query and print it out again but need a complete query to parse) I'll try to simnpify: SELECT * { ?pat a ec:Patient OPTIONAL { ?pat ec:Has_Finding ?X1 . ?X1 a ?T1 . ?T1 rdfs:subClassOf* :Z1} OPTIONAL { ?pat ec:Has_Finding ?X2 . ?X2 a ?T2 . ?T2 rdfs:subClassOf* :Z2} ... } For every X1/T1/Z1, ARQ will try X2/T2/Z2 so the numbers grow worse than linearly. if number of(X1/T1/Z1) is 2 and number of(X2/T2/Z2) is 3, there are 6 rows and 3 rows have one of the two set of (X1/T1/Z1) values and the other 3 have the other set. If it were 5 and 10, there are 50 rows. What you can do is use UNION. SELECT * { ?pat a ec:Patient { { ?pat ec:Has_Finding ?X1 . ?X1 a ?T1 . ?T1 rdfs:subClassOf* :Z1} UNION { ?pat ec:Has_Finding ?X2 . ?X2 a ?T2 . ?T2 rdfs:subClassOf* :Z2} UNION { ... } ... } This will put 2 rows for (X1/T1/Z1), with cols for (X2/T2/Z2) being unset, and 3 rows for (X2/T2/Z2) with (X1/T1/Z1) being unset. 5 rows. At small numbers, not so much difference but if it were 5 and 10, there are 15 rows as opposed to 50. Andy On 25/07/13 14:14, Olivier Rossel wrote: > (My answer does not help very much, but anyway :) > > I am not sure the spreadsheet view is ok for very nested data structures > (that makes too many columns in the end). > > I would use a CONSTRCUT to retrieve a graph of > patient-findings-findingType, then do some post-processing to build a > facetable data structure. > cf this example: > http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/test.html > coming from that JSON: > http://people.csail.mit.edu/dfhuynh/projects/hierarchical-facets/data.json# > > > > On Thu, Jul 25, 2013 at 11:07 AM, <[email protected]> wrote: > >> Hi, >> >> I am running into some performance issues and was wondering if I was >> approaching the problem from the correct angle or if there is something >> more efficient. >> >> I have a property "Has_Finding" which is used to assert things about >> patients. There are different kinds of findings, e.g. Dyspnea, Dysphagia, >> Death, Hypertension ... So my data looks this: >> >> patient1 ec:Has_Finding Dyspnea1 . Dyspnea1 a nci:Dyspnea_Score_2 . >> patient1 ec:Has_Finding Dysphagia1 . Dysphagia1 a nci:Dysphagia_Score_1. >> patient1 ec:Has_Finding Dyspnea2 . Dyspnea2 a nci:Dyspnea_Score_3. >> patient2 ec:Has_Finding Dyspnea3. Dyspnea3 a nci:Dyspnea_Score_2. >> patient2 ec:Has_Finding Hypertension1 . Hypertension1 a nci:Hypertension . >> etc. >> >> My users want to know about findings. I am offering a GUI-based query tool >> and am generating the Sparql queries automatically. The users are unaware >> of Sparql or anything like that. >> >> So I can easily get all the findings with the simple sparql query: >> >> SELECT ?pat ?findingType >> WHERE { ?pat ec:Has_Finding ?finding . ?finding a ?findingType. } >> >> The problem is that the user has to figure out what finding a particular >> result row is talking about by looking at the value of ?findingType and by >> string comparison (or something unreliable like that) find out that this is >> actually representing a Dyspnea Score. Or a Dysphagia Score etc. before >> he/she can work with the value itself. But the next row could already be >> representing something else. This way of analyzing the result requires that >> the user has some semantic knowledge about the data. >> >> So I would like to make the query return the different types of findings >> as "columns" instead of "rows". >> >> SELECT ?pat ?dypsneaType ?dysphagiaType ?hypertType >> WHERE { >> ?pat a ec:Patient . >> OPTIONAL { ?pat ec:Has_Finding ?dyspnea . ?dyspnea a ?dyspneaType . >> ?dyspneaType rdfs:subClassOf* nci:Dyspnea_Score . } >> OPTIONAL { ?pat ec:Has_Finding ?dysphagia . ?dysphagia a ?dysphagiaType >> . ?dysphagiaType rdfs:subClassOf* nci:Dysphagia_Score . } >> OPTIONAL { ?pat ec:Has_Finding ?hypert . ?hypert a ?hyertType . >> ?hypertType rdfs:subClassOf* nci:Hypertension . } >> } >> >> Of course there are more than just 3 different finding types. When >> executing these queries, I noticed that the more of these rows I am adding, >> the longer the execution time gets. Which is expected, but from a certain >> point it seems to increase by a factor of 2 or more. >> >> While asking for 1 or 2 types of findings takes 5 seconds. Adding a 3rd >> takes 10 seconds. After 5 we are up to 40 seconds. At 10 we are at 2-3 >> minutes and around 15 it takes so long that it is not feasible anymore. I >> am currently only testing on a fraction of my data (about 1000 patients). >> This query is a little simplified since there are other properties that I >> use, e.g. Has_Id, Has_Date_Observed. But these are all datatype properties >> pointing to literals so I left them out for sake of simplicity. >> >> My questions are: >> 1. Is that a common/expected query? >> 2. Is there a different way of achieving this without making the query >> analyze these rdfs:subClassOf* triples. >> 3. Should I just use the first, general query and then do some >> post-processing in my code to split it up into columns before passing it to >> the user? >> >> Thanks for any help! >> Wolfgang >> >> >
