Re: SPARQL query optimization question

Rob Vesse Tue, 09 Aug 2016 02:18:10 -0700

Mark

 The key thing to understand when talking about SPARQL performance is that 
strictly speaking evaluation is bottom up from the leftmost child operator of 
the query. It is easiest to talk about these things by looking at the algebra 
form, for your two queries:


(base <http://example/base/>
  (prefix ((sem: <urn:sem:>))
    (graph <urn:guid:wood>
       (sequence
         (bgp (triple ?pc ?p ?o))
         (path ?e (path+ sem:AnotherPred) ?pc)
         (bgp
           (triple ?e ?ep ?eo)
           (triple ?pc sem:SomePred ?fc)
           (triple ?fc ?fp ?fo)
         )
         (project (?pc)
           (bgp (triple ?pc sem:MyPred "2")))))))

So in this first case he left most child is the very generic triple pattern 
that matches everything in the graph followed by a potentially expensive 
property path and another generic  triple pattern, finally your specific 
subquery is that rightmost child so will be evaluated last. Moving the  sub 
query earlier in your query may significantly improve performance.

 For your second query:

(base <http://example/base/>
   (prefix ((sem: <urn:sem:>))
     (graph <urn:guid:wood>
       (union
         (union
           (sequence
             (bgp (triple ?pc ?p ?o))
             (project (?pc)
               (bgp (triple ?pc sem:MyPred "2"))))
           (sequence
             (bgp
               (triple ?pc sem:SomePred ?fc)
               (triple ?fc ?fp ?fo)
             )
             (project (?pc)
               (bgp (triple ?pc sem:MyPred "2")))))
         (sequence
           (path ?e (path+ sem:AnotherPred) ?pc)
           (bgp (triple ?e ?ep ?eo))
           (project (?pc)
             (bgp (triple ?pc sem:MyPred "2"))))))))

Hear your specific sub queries get to evaluate sooner since they are further 
left in the operator tree.

 In both these queries you see the use of the sequence operator which is 
essentially a streaming index join where  possible solutions from the earlier 
operators in the sequence are substituted into the operators later in the 
sequence to reduce the search space. The ordering of operators in the second 
query presumably produces a much smaller solution space hence the faster 
evaluation time.

Sub query results are never reused in Jena. Unfortunately there is no syntactic 
sugar to make repeating use of a subquery easier nor have I yet to see any 
proposal for such a syntax should look like. For anything to be incorporated 
into a future standard there typically needs to be a clear use case ( which 
there is) but also one or more existing extensions to the language that 
demonstrate such an extension is actively used. Experimenting with this in ARQ 
would be a nice future submission or student project.

 Rob

On 08/08/2016 17:52, "Mark D Wood" <[email protected]> wrote:

    I am trying to piece together three different but connected portions of a 
graph extracted from a large triple store, and I am surprised by some 
performance results that I see.   Some guidance would be appreciated.
    
    The most obvious way to construct the desired data is the following, where 
I'm trying to extract the all predicates pertaining to resources ?pc, ?e and 
?fc, and where subjects ?pc are the critical links.   The values for ?pc are 
defined by the subquery.
    
    CONSTRUCT {
        ?pc ?p ?o .
        ?e ?ep ?eo .
        ?fc ?fp ?fo
    } WHERE {
        GRAPH <urn:guid:wood>
        {
            ?pc ?p ?o .
            ?e sem:AnotherPred+ ?pc .
            ?e ?ep ?eo .
            ?pc  sem:SomePred ?fc .
            ?fc ?fp ?fo
            {
                SELECT ?pc
                WHERE
                {
                    ?pc  sem:MyPred "2"
                }
            }
        }
    }
    
    where the three different patterns in the CONSTRUCT template correspond to 
the three different types of (related) data that I'm extracting.  The subquery 
imposes a restriction on the subjects that I'm interested in.
    
    The above form takes about 70 seconds to run, whereas if I restructure it 
to use the UNION construct, it executes in less than a second:
    
    CONSTRUCT {
        ?pc ?p ?o .
        ?fc ?fp ?fo .
        ?e ?ep ?eo
    }
    WHERE {
        GRAPH <urn:guid:wood>
        {
            {
                ?pc ?p ?o .
                {
                    SELECT ?pc
                    WHERE
                    {
                        ?pc  sem:MyPred "2"
                    }
                }
            } UNION {
                ?pc  sem:SomePred ?fc .
                ?fc ?fp ?fo
                {
                    SELECT ?pc
                    WHERE
                    {
                        ?pc  sem:MyPred "2"
                    }
                }
            } UNION {
                ?e sem:AnotherPred+ ?pc .
                ?e ?ep ?eo
                {
                    SELECT ?pc
                    WHERE
                    {
                        ?pc  sem:MyPred "2"
                    }
                }
            }
        }
    }
    
    
    *       Why is the second form so much faster?
    
    *       Is the SPARQL engine smart enough to see that the subquery is the 
same across the three different UNION statements? (Given the speed in which it 
executes, I would assume so!)
    
    *       Is there any syntactic sugar-coating that I can do, to avoid 
repeating the subquery?  (I'm guessing no, but perhaps something is planned for 
a future version of SPARQL?)
    
    Thanks,
    -Mark

Re: SPARQL query optimization question

Reply via email to