Make

Comments inline:

On 04/06/2014 04:53, "Mark Feblowitz" <[email protected]> wrote:

>This is a fragment of an even larger query that might never complete in
>our lifetimes:
>
>SELECT DISTINCT ?O ?T  ?E
>WHERE{  
>?E a x:E. 
>       { SELECT ?O ?T 
>                 WHERE {
>                       ?O :oE ?E;
>                               :oT ?T .
>                       } ORDER BY DESC(?T)
>                         LIMIT 3
>               }
>       }
>
>Here’s what I want:
>
>For every E, the top 3 O’s as sorted by T.

The above query does not do this, achieving what you want in a single
SPARQL query is actually quite tricky and often very expensive on
resources.  See 
http://answers.semanticweb.com/questions/9842/how-to-limit-sparql-solution-
group-size for some ways of achieving this none of which will perform
particularly well.

As for why your query does not do what you expect it to the reason is that
SPARQL evaluation works on the algebra translation of the query and is
defined as bottom up evaluation not top down.

Your query algebra is as follows:

(distinct
 (project (?O ?T ?E)
  (sequence
   (bgp (triple ?E rdf:type x:E))
   (project (?O ?T)
    (top (3 (desc ?T))
     (bgp
      (triple ?O :oE ?/E)
      (triple ?O :oT ?T)
     ))))))


So in principal what happens is that your initial BGP is evaluated and
then the inner sub-query gets evaluated.  The results of the two are then
joined together and as there are no common variable between the outer BGP
and inner sub-query you get a cross product.  So you get 3 results
multiplied by the 7273 results giving you ~ 22k results

In principal this query shouldn't run that slow though from the algebra I
think I see what is happening.

The sequence operator you see means ARQ has recognised that the join part
of the query can actually be done in a top down fashion and in most cases
this would be an optimisation and make the query run faster.  This means
that it evaluates the LHS of the join (your ?E a x:E triple) in an
iterative fashion and substitutes each solution into the RHS which has no
effect because as you can see in the algebra the two ?E's are not the same
due to SPARQLs scoping rules.

This means that your inner sub-query is getting executed 7273 times which
is likely why it doesn't complete in reasonable time

You can try disabling that specific optimisation and see if that resolves
the problem.  Assuming this is still talking to a remote Fuseki you can
start your Fuseki instance with the --set arq:optIndexJoinStrategy=false
argument added to disable the specific optimisation in question.  Of if
you are instead testing this direct from code against an in-memory dataset
you can use ARQ.getContext().set(ARQ.optIndexJoinStrategy, false) to
disable the optimisation.

It looks like there is scope for improving this optimisation to be more
conservative when it involves potentially expensive operators, if you
could probably a test dataset that reproduces the problem that would help
us with looking at improving the optimisation.

>
>My TDB store contains 779 O’s, 7273 E’s, and a T for every O.
>
>The full query is this:
>
>SELECT DISTINCT ?O ?T ?S ?E ?P
>WHERE{  
>               ?E a prov:E. 
>               { SELECT ?O ?T 
>                 WHERE {
>                       ?O :otE ?E;
>                               :oT ?T .
>                       } ORDER BY DESC(?T)
>                         LIMIT 3
>               }{ SELECT ?SH
>                 WHERE {
>                 ?EHS :ehsO ?O.
>                 ?EH :ehEHSf ?EHS.
>                 ?SHS :shsEHS ?EHS;
>                       :shsSH ?SH .
>                 ?SH :rank ?SHRank.
>                 } ORDER BY ASC(?SHRank)
>                   LIMIT 1 
>               } 
>               ?SH :shS ?S.
>       OPTIONAL {?O :op ?P.}
>       }
>
>A pretty complex query, and one that *certainly* won’t complete if the
>simpler one won’t.

Yes the query suffers from the same issue as the earlier one based on
looking at the algebra:

(distinct
 (project (?O ?T ?S ?E ?P)
  (conditional
   (sequence
    (bgp (triple ?E rdf:type prov:E))
    (project (?O ?T)
     (top (3 (desc ?T))
     (bgp
      (triple ?O :otE ?/E)
      (triple ?O :oT ?T)
     )))
    (project (?SH)
     (top (1 (asc ?/SHRank))
      (bgp
       (triple ?/EHS :ehsO ?/O)
       (triple ?/EH :ehEHSf ?/EHS)
       (triple ?/SHS :shsEHS ?/EHS)
       (triple ?/SHS :shsSH ?SH)
       (triple ?SH :rank ?/SHRank)
      )))
    (bgp (triple ?SH :shS ?S)))
   (bgp (triple ?O :op ?P)))))


This query suffers from the above effect even worse because the results of
the ?E a prov:E triple pattern feed into the first sub-query which then
feed into the second sub-query so you get huge numbers of executions of
the second sub-query.

Again disabling the culprit optimisation may significantly improve things.

Btw the service I used to generate the algebras is the query validator
service at http://sparql.org/query-validator.html - this service is also
built into Fuseki so if you haven't customised Fuseki heavily you'll have
this in your local instance as well.

Rob

>
>Thanks,
>
>Mark




Reply via email to