Make Comments inline:
On 04/06/2014 04:53, "Mark Feblowitz" <[email protected]> wrote: >This is a fragment of an even larger query that might never complete in >our lifetimes: > >SELECT DISTINCT ?O ?T ?E >WHERE{ >?E a x:E. > { SELECT ?O ?T > WHERE { > ?O :oE ?E; > :oT ?T . > } ORDER BY DESC(?T) > LIMIT 3 > } > } > >Here’s what I want: > >For every E, the top 3 O’s as sorted by T. The above query does not do this, achieving what you want in a single SPARQL query is actually quite tricky and often very expensive on resources. See http://answers.semanticweb.com/questions/9842/how-to-limit-sparql-solution- group-size for some ways of achieving this none of which will perform particularly well. As for why your query does not do what you expect it to the reason is that SPARQL evaluation works on the algebra translation of the query and is defined as bottom up evaluation not top down. Your query algebra is as follows: (distinct (project (?O ?T ?E) (sequence (bgp (triple ?E rdf:type x:E)) (project (?O ?T) (top (3 (desc ?T)) (bgp (triple ?O :oE ?/E) (triple ?O :oT ?T) )))))) So in principal what happens is that your initial BGP is evaluated and then the inner sub-query gets evaluated. The results of the two are then joined together and as there are no common variable between the outer BGP and inner sub-query you get a cross product. So you get 3 results multiplied by the 7273 results giving you ~ 22k results In principal this query shouldn't run that slow though from the algebra I think I see what is happening. The sequence operator you see means ARQ has recognised that the join part of the query can actually be done in a top down fashion and in most cases this would be an optimisation and make the query run faster. This means that it evaluates the LHS of the join (your ?E a x:E triple) in an iterative fashion and substitutes each solution into the RHS which has no effect because as you can see in the algebra the two ?E's are not the same due to SPARQLs scoping rules. This means that your inner sub-query is getting executed 7273 times which is likely why it doesn't complete in reasonable time You can try disabling that specific optimisation and see if that resolves the problem. Assuming this is still talking to a remote Fuseki you can start your Fuseki instance with the --set arq:optIndexJoinStrategy=false argument added to disable the specific optimisation in question. Of if you are instead testing this direct from code against an in-memory dataset you can use ARQ.getContext().set(ARQ.optIndexJoinStrategy, false) to disable the optimisation. It looks like there is scope for improving this optimisation to be more conservative when it involves potentially expensive operators, if you could probably a test dataset that reproduces the problem that would help us with looking at improving the optimisation. > >My TDB store contains 779 O’s, 7273 E’s, and a T for every O. > >The full query is this: > >SELECT DISTINCT ?O ?T ?S ?E ?P >WHERE{ > ?E a prov:E. > { SELECT ?O ?T > WHERE { > ?O :otE ?E; > :oT ?T . > } ORDER BY DESC(?T) > LIMIT 3 > }{ SELECT ?SH > WHERE { > ?EHS :ehsO ?O. > ?EH :ehEHSf ?EHS. > ?SHS :shsEHS ?EHS; > :shsSH ?SH . > ?SH :rank ?SHRank. > } ORDER BY ASC(?SHRank) > LIMIT 1 > } > ?SH :shS ?S. > OPTIONAL {?O :op ?P.} > } > >A pretty complex query, and one that *certainly* won’t complete if the >simpler one won’t. Yes the query suffers from the same issue as the earlier one based on looking at the algebra: (distinct (project (?O ?T ?S ?E ?P) (conditional (sequence (bgp (triple ?E rdf:type prov:E)) (project (?O ?T) (top (3 (desc ?T)) (bgp (triple ?O :otE ?/E) (triple ?O :oT ?T) ))) (project (?SH) (top (1 (asc ?/SHRank)) (bgp (triple ?/EHS :ehsO ?/O) (triple ?/EH :ehEHSf ?/EHS) (triple ?/SHS :shsEHS ?/EHS) (triple ?/SHS :shsSH ?SH) (triple ?SH :rank ?/SHRank) ))) (bgp (triple ?SH :shS ?S))) (bgp (triple ?O :op ?P))))) This query suffers from the above effect even worse because the results of the ?E a prov:E triple pattern feed into the first sub-query which then feed into the second sub-query so you get huge numbers of executions of the second sub-query. Again disabling the culprit optimisation may significantly improve things. Btw the service I used to generate the algebras is the query validator service at http://sparql.org/query-validator.html - this service is also built into Fuseki so if you haven't customised Fuseki heavily you'll have this in your local instance as well. Rob > >Thanks, > >Mark
