Re: SPARQL performance question

Andy Seaborne Thu, 27 Feb 2020 05:11:18 -0800

Thanks!

On 27/02/2020 12:27, Steve Vestal wrote:

To answer your question, Andy,


== The old query, some names abbreviated:
     PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
     PREFIX owl:<http://www.w3.org/2002/07/owl#>
     SELECT ?connectionAA ?connectionAB ?connectionBA ?connectionBB
?leftA ?leftB ?rightA ?rightB ?singleHardware
     WHERE {
     ?connectionAA rdf:type <#portConnection>.
     ?connectionAB rdf:type <#portConnection>.
     ?connectionBA rdf:type <#portConnection>.
     ?connectionBB rdf:type <#portConnection>.
     ?leftA rdf:type <#thread>.
     ?leftB rdf:type <#thread>.
     ?rightA rdf:type <#thread>.
     ?rightB rdf:type <#thread>.
     ?singleHardware rdf:type <#platform>.
     ?leftA <#simplexConnectTo> ?connectionAA.
     ?connectionAA <#simplexConnectTo> ?rightA.
     ?leftA <#simplexConnectTo> ?connectionAB.
     ?connectionAB <#simplexConnectTo> ?rightB.
     ?leftB <#simplexConnectTo> ?connectionBA.
     ?connectionBA <#simplexConnectTo> ?rightA.
     ?leftB <#simplexConnectTo> ?connectionBB.
     ?connectionBB <#simplexConnectTo> ?rightB.
     ?connectionAA <#boundTo> ?singleHardware.
     ?connectionBA <#boundTo> ?singleHardware.
     FILTER (?connectionAA!=?connectionAB && ?connectionAA!=?connectionBA
&& ?connectionAA!=?connectionBB && ?connectionAA!=?leftA &&
?connectionAA!=?leftB && ?connectionAA!=?rightA &&
?connectionAA!=?rightB && ?connectionAA!=?singleHardware
             && ?connectionAB!=?connectionBA &&
?connectionAB!=?connectionBB && ?connectionAB!=?leftA &&
?connectionAB!=?leftB && ?connectionAB!=?rightA &&
?connectionAB!=?rightB && ?connectionAB!=?singleHardware
             && ?connectionBA!=?connectionBB && ?connectionBA!=?leftA &&
?connectionBA!=?leftB && ?connectionBA!=?rightA &&
?connectionBA!=?rightB && ?connectionBA!=?singleHardware
             && ?connectionBB!=?leftA && ?connectionBB!=?leftB &&
?connectionBB!=?rightA && ?connectionBB!=?rightB &&
?connectionBB!=?singleHardware
             && ?leftA!=?leftB && ?leftA!=?rightA && ?leftA!=?rightB &&
?leftA!=?singleHardware
             && ?leftB!=?rightA && ?leftB!=?rightB && ?leftB!=?singleHardware
             && ?rightA!=?rightB && ?rightA!=?singleHardware
             && ?rightB!=?singleHardware)
     }

== The new query:
     PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
     PREFIX owl:<http://www.w3.org/2002/07/owl#>
     SELECT ?connectionAA ?connectionAB ?connectionBA ?connectionBB
?leftA ?leftB ?rightA ?rightB ?singleHardware
     WHERE {
     ?leftA rdf:type <#thread>.
     ?connectionAA rdf:type <#portConnection>.
     ?leftA <#simplexConnectTo> ?connectionAA.
     FILTER(!sameTerm(?leftA,?connectionAA)).
     ?rightA rdf:type <#thread>.
     ?connectionAA <#simplexConnectTo> ?rightA.
     FILTER(!sameTerm(?leftA,?rightA) && !sameTerm(?connectionAA,?rightA)).
     ?connectionAB rdf:type <#portConnection>.
     ?leftA <#simplexConnectTo> ?connectionAB.
     FILTER(!sameTerm(?rightA,?connectionAB) &&
!sameTerm(?leftA,?connectionAB) && !sameTerm(?connectionAA,?connectionAB)).
     ?rightB rdf:type <#thread>.
     ?connectionAB <#simplexConnectTo> ?rightB.
     FILTER(!sameTerm(?rightA,?rightB) && !sameTerm(?leftA,?rightB) &&
!sameTerm(?connectionAA,?rightB) && !sameTerm(?connectionAB,?rightB)).
     ?leftB rdf:type <#thread>.
     ?connectionBA rdf:type <#portConnection>.
     ?leftB <#simplexConnectTo> ?connectionBA.
     FILTER(!sameTerm(?rightA,?leftB) && !sameTerm(?rightB,?leftB) &&
!sameTerm(?leftA,?leftB) && !sameTerm(?connectionAA,?leftB) &&
!sameTerm(?connectionAB,?leftB) && !sameTerm(?rightA,?connectionBA) &&
!sameTerm(?rightB,?connectionBA) && !sameTerm(?leftB,?connectionBA) &&
!sameTerm(?leftA,?connectionBA) &&
!sameTerm(?connectionAA,?connectionBA) &&
!sameTerm(?connectionAB,?connectionBA)).
     ?connectionBA <#simplexConnectTo> ?rightA.
     ?connectionBB rdf:type <#portConnection>.
     ?leftB <#simplexConnectTo> ?connectionBB.
     FILTER(!sameTerm(?rightA,?connectionBB) &&
!sameTerm(?rightB,?connectionBB) && !sameTerm(?leftB,?connectionBB) &&
!sameTerm(?leftA,?connectionBB) &&
!sameTerm(?connectionBA,?connectionBB) &&
!sameTerm(?connectionAA,?connectionBB) &&
!sameTerm(?connectionAB,?connectionBB)).
     ?connectionBB <#simplexConnectTo> ?rightB.
     ?singleHardware rdf:type <#platform>.
     ?connectionAA <#boundTo> ?singleHardware.
     FILTER(!sameTerm(?rightA,?singleHardware) &&
!sameTerm(?rightB,?singleHardware) && !sameTerm(?leftB,?singleHardware)
&& !sameTerm(?leftA,?singleHardware) &&
!sameTerm(?connectionBA,?singleHardware) &&
!sameTerm(?connectionAA,?singleHardware) &&
!sameTerm(?connectionBB,?singleHardware) &&
!sameTerm(?connectionAB,?singleHardware)).
     ?connectionBA <#boundTo> ?singleHardware.
}

On 2/26/2020 8:06 AM, Andy Seaborne wrote:



On 26/02/2020 11:26, Steve Vestal wrote:

Reporting back as requested to close this issue.


Thank you - knowing usage and experiences is always helpful, as is
whether sugegstions did indeed have a useful effect.

Recall the original select query took ~25 minutes on a small test case,
where the query was issued against an OntModel with four imports, trying
various reasoners since reasoning is necessary to get any results in
this test.

Number of asserted sentences: 712
Number of forward-chained entailments at OntModel creation (making
assumptions about reasoning and import handling): 1130
Size of entailment closure: 4421, which took 136 ms to compute (all
times wall-clock times on a laptop)

Experiments indicated a bottleneck occurred due to a FILTER at the end,
a conjunction of many varA!=varB to anti-alias solutions.  VisualVM
profiling indicated over 99% of the time was spent in cycles of
recursive calls involving
org.apache.jena.sparql.engine.iterator.QueryIterRepeatApply.hasNextBinding,

makeNextStep, *.QueryIteratorBase.hasNext, and
*.QueryIterProcessBinding.hasNextBinding.  (I have no idea what if
anything that means, but in case it is of interest to someone...)


It does to me.

These are the methods calls from moving from one intermediate result
to another suggesting there are a lot of intermediate rows being
processed.

The first exercise was to omit the anti-aliasing filter from the select
query itself and post-process the result to ignore solution rows with
aliased variable solutions.  That increased the size of the select
result from 192 to 576 rows but reduced the time from ~25 minutes to
~450 ms.

The initial query listed all rdf:type triples first, triples that
specified properties between nodes next, and a final big-bang filter at
the end.  The second exercise was to shuffle these triples into an order
intended to progressively narrow down the search space under the
assumption triples are processed in they order they are listed in the
query (as suggested in "Learning SPARQL," noting that Andy's earlier
post said ARQ does do some reordering for optimization).  The original
filter was fragmented into multiple smaller pieces that were also
shuffled among the other triples for this exercise.  This resulted in a
time of 111 ms, further reduced to 99 ms by switching from "!=" to
"!sameTerm" in the filters.


Good news!

What is the final query?


I'm back in the saddle.  Thanks again for everyone's help.


     Andy


On 2/25/2020 12:33 PM, Andy Seaborne wrote:

Current is 3.14.0.

On 25/02/2020 17:38, Steve Vestal wrote:

I'm currently using 3.8.0 jars.

On 2/25/2020 11:30 AM, Andy Seaborne wrote:



On 25/02/2020 16:25, Steve Vestal wrote:

I read that chapter in DuCharme's book and have some things to try,
such
as moving the rdf:type triples around and fragmenting that single
filter
into pieces distributed throughout the query, and just doing my own
post-processing to get disjoint variables.  I'll report back when
time
permits.

My reading did raise the question of what ARQ does for optimization,
which the book suggested can vary quite a bit between different
SPARQL
engines.   I took an admittedly very hasty peek at some sections of
the
online ARQ documentation, and it mentions optimization in a
number of
places, but is there a tutorial overview on do's and don'ts when
formulating the queries?  A specific question is, will user
ordering of
triples have a significant effect and should always be considered
because that's the order in which search will be done, or is the
optimizer going to do its own reordering regardless?  Your
suggestion
implies the former.


ARQ does do some reordering but the issue here is made complicated by
the fact that filter placement and reordering interact.

Putting in {} sometime helps as well because

{ triple patterns FILTERs }
{ triple patterns FILTERs }

is actually a different query and can push the optimizer to make
better choices.

Optimization is a lot of "it depends".

(BTW which version are you running?)

       Andy


On 2/25/2020 8:54 AM, Andy Seaborne wrote:

It might be worth reordering the tripe patterns and/or putting in
some
clustering: there is a large amount of cross product being done
which
means many,many unwanted or duplicate pieces of work.

Fore example, move the rdf:type to the end (do you need them at
all?)

        Andy

(Replaced long URIs for email:)

?leftA    <#simplexConnectTo>  ?connectionAA .
?connectionAA <#simplexConnectTo>  ?rightA .

?leftA    <#simplexConnectTo>  ?connectionAB .
?connectionAB <#simplexConnectTo>  ?rightB .

?leftB    <#simplexConnectTo>  ?connectionBA .
?connectionBA <#simplexConnectTo>  ?rightA .

?leftB    <#simplexConnectTo>  ?connectionBB .
?connectionBB <#simplexConnectTo>  ?rightB .

?connectionAA <fhowl/singlepointfailpattern#boundTo>
?singleHardware .
?connectionBA <fhowl/singlepointfailpattern#boundTo>
?singleHardware .

?connectionAA rdf:type <#portConnection> .
?connectionAB rdf:type <#portConnection> .
?connectionBA rdf:type <#portConnection> .
?connectionBB rdf:type <#portConnection> .

?leftA    rdf:type              <#thread> .
?leftB    rdf:type              <#thread> .
?rightA   rdf:type              <#thread> .
?rightB   rdf:type              <#thread> .
?singleHardware rdf:type              <#platform> .





On 24/02/2020 10:01, Rob Vesse wrote:

To add to what else has been said

Query execution in Apache Jena ARQ is based upon lazy evaluation
wherever possible.  Calling execSelect() simply prepares a
ResultSet
that is capable of delivering the results but doesn't actually
evaluate the query and produce any results until you call
hasNext()/next().  When you call either of these methods then ARQ
does the minimum amount of work to return the next result (or
batch
of results) depending on the underlying algebra of the query.

Rob

On 23/02/2020, 18:58, "Steve Vestal"
<[email protected]> wrote:

         I'm looking for suggestions on a SPARQL performance issue.
My test
         model has ~800 sentences, and processing of one select
query
takes about
         25 minutes.  The query is a basic graph pattern with 9
variables
and 20
         triples, plus a filter that forces distinct variables
to have
distinct
         solutions using pair-wise not-equals constraints.  No
option
clause or
         anything else fancy.
              I am issuing the query against an inference model.
Most of
the asserted
         sentences are in imported models.  If I iterate over
all the
statements
         in the OntModel, I get ~1500 almost instantly.  I
experimented with
         several of the reasoners.
              Below is the basic control flow.  The thing I found
curious
is that the
         execSelect() method finishes almost instantly.  It is the
iteration over
         the ResultSet that is taking all the time, it seems in the
call to
         selectResult.hasNext(). The result has 192 rows, 9
columns.  The
results
         are provided in bursts of 8 rows each, with ~1 minute
between
bursts.
                      OntModel ontologyModel = getMyOntModel(); //
Tried
various reasoners
                 String selectQuery = getMySelectQuery();
                 QueryExecution selectExec =
         QueryExecutionFactory.create(selectQuery, ontologyModel);
                 ResultSet selectResult = selectExec.execSelect();
                 while (selectResult.hasNext()) {  // Time seems
to be
spent in
         hasNext
                     QuerySolution selectSolution =
selectResult.next();
                     for (String var : getMyVariablesOfInterest() {
                         RDFNode varValue =
selectSolution.get(var);
                         // process varValue
                     }
                 }
              Any suggestions would be appreciated.

Re: SPARQL performance question

Reply via email to