Re: SPARQL performance question

Steve Vestal Thu, 27 Feb 2020 04:28:13 -0800

To answer your question, Andy,

== The old query, some names abbreviated:
    PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX owl:<http://www.w3.org/2002/07/owl#>
    SELECT ?connectionAA ?connectionAB ?connectionBA ?connectionBB
?leftA ?leftB ?rightA ?rightB ?singleHardware
    WHERE {
    ?connectionAA rdf:type <#portConnection>.
    ?connectionAB rdf:type <#portConnection>.
    ?connectionBA rdf:type <#portConnection>.
    ?connectionBB rdf:type <#portConnection>.
    ?leftA rdf:type <#thread>.
    ?leftB rdf:type <#thread>.
    ?rightA rdf:type <#thread>.
    ?rightB rdf:type <#thread>.
    ?singleHardware rdf:type <#platform>.
    ?leftA <#simplexConnectTo> ?connectionAA.
    ?connectionAA <#simplexConnectTo> ?rightA.
    ?leftA <#simplexConnectTo> ?connectionAB.
    ?connectionAB <#simplexConnectTo> ?rightB.
    ?leftB <#simplexConnectTo> ?connectionBA.
    ?connectionBA <#simplexConnectTo> ?rightA.
    ?leftB <#simplexConnectTo> ?connectionBB.
    ?connectionBB <#simplexConnectTo> ?rightB.
    ?connectionAA <#boundTo> ?singleHardware.
    ?connectionBA <#boundTo> ?singleHardware.
    FILTER (?connectionAA!=?connectionAB && ?connectionAA!=?connectionBA
&& ?connectionAA!=?connectionBB && ?connectionAA!=?leftA &&
?connectionAA!=?leftB && ?connectionAA!=?rightA &&
?connectionAA!=?rightB && ?connectionAA!=?singleHardware
            && ?connectionAB!=?connectionBA &&
?connectionAB!=?connectionBB && ?connectionAB!=?leftA &&
?connectionAB!=?leftB && ?connectionAB!=?rightA &&
?connectionAB!=?rightB && ?connectionAB!=?singleHardware
            && ?connectionBA!=?connectionBB && ?connectionBA!=?leftA &&
?connectionBA!=?leftB && ?connectionBA!=?rightA &&
?connectionBA!=?rightB && ?connectionBA!=?singleHardware
            && ?connectionBB!=?leftA && ?connectionBB!=?leftB &&
?connectionBB!=?rightA && ?connectionBB!=?rightB &&
?connectionBB!=?singleHardware
            && ?leftA!=?leftB && ?leftA!=?rightA && ?leftA!=?rightB &&
?leftA!=?singleHardware
            && ?leftB!=?rightA && ?leftB!=?rightB && ?leftB!=?singleHardware
            && ?rightA!=?rightB && ?rightA!=?singleHardware
            && ?rightB!=?singleHardware)
    }


== The new query:
    PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX owl:<http://www.w3.org/2002/07/owl#>
    SELECT ?connectionAA ?connectionAB ?connectionBA ?connectionBB
?leftA ?leftB ?rightA ?rightB ?singleHardware
    WHERE {
    ?leftA rdf:type <#thread>.
    ?connectionAA rdf:type <#portConnection>.
    ?leftA <#simplexConnectTo> ?connectionAA.
    FILTER(!sameTerm(?leftA,?connectionAA)).
    ?rightA rdf:type <#thread>.
    ?connectionAA <#simplexConnectTo> ?rightA.
    FILTER(!sameTerm(?leftA,?rightA) && !sameTerm(?connectionAA,?rightA)).
    ?connectionAB rdf:type <#portConnection>.
    ?leftA <#simplexConnectTo> ?connectionAB.
    FILTER(!sameTerm(?rightA,?connectionAB) &&
!sameTerm(?leftA,?connectionAB) && !sameTerm(?connectionAA,?connectionAB)).
    ?rightB rdf:type <#thread>.
    ?connectionAB <#simplexConnectTo> ?rightB.
    FILTER(!sameTerm(?rightA,?rightB) && !sameTerm(?leftA,?rightB) &&
!sameTerm(?connectionAA,?rightB) && !sameTerm(?connectionAB,?rightB)).
    ?leftB rdf:type <#thread>.
    ?connectionBA rdf:type <#portConnection>.
    ?leftB <#simplexConnectTo> ?connectionBA.
    FILTER(!sameTerm(?rightA,?leftB) && !sameTerm(?rightB,?leftB) &&
!sameTerm(?leftA,?leftB) && !sameTerm(?connectionAA,?leftB) &&
!sameTerm(?connectionAB,?leftB) && !sameTerm(?rightA,?connectionBA) &&
!sameTerm(?rightB,?connectionBA) && !sameTerm(?leftB,?connectionBA) &&
!sameTerm(?leftA,?connectionBA) &&
!sameTerm(?connectionAA,?connectionBA) &&
!sameTerm(?connectionAB,?connectionBA)).
    ?connectionBA <#simplexConnectTo> ?rightA.
    ?connectionBB rdf:type <#portConnection>.
    ?leftB <#simplexConnectTo> ?connectionBB.
    FILTER(!sameTerm(?rightA,?connectionBB) &&
!sameTerm(?rightB,?connectionBB) && !sameTerm(?leftB,?connectionBB) &&
!sameTerm(?leftA,?connectionBB) &&
!sameTerm(?connectionBA,?connectionBB) &&
!sameTerm(?connectionAA,?connectionBB) &&
!sameTerm(?connectionAB,?connectionBB)).
    ?connectionBB <#simplexConnectTo> ?rightB.
    ?singleHardware rdf:type <#platform>.
    ?connectionAA <#boundTo> ?singleHardware.
    FILTER(!sameTerm(?rightA,?singleHardware) &&
!sameTerm(?rightB,?singleHardware) && !sameTerm(?leftB,?singleHardware)
&& !sameTerm(?leftA,?singleHardware) &&
!sameTerm(?connectionBA,?singleHardware) &&
!sameTerm(?connectionAA,?singleHardware) &&
!sameTerm(?connectionBB,?singleHardware) &&
!sameTerm(?connectionAB,?singleHardware)).
    ?connectionBA <#boundTo> ?singleHardware.
}

On 2/26/2020 8:06 AM, Andy Seaborne wrote:
>
>
> On 26/02/2020 11:26, Steve Vestal wrote:
>> Reporting back as requested to close this issue.
>
> Thank you - knowing usage and experiences is always helpful, as is
> whether sugegstions did indeed have a useful effect.
>
>> Recall the original select query took ~25 minutes on a small test case,
>> where the query was issued against an OntModel with four imports, trying
>> various reasoners since reasoning is necessary to get any results in
>> this test.
>>
>> Number of asserted sentences: 712
>> Number of forward-chained entailments at OntModel creation (making
>> assumptions about reasoning and import handling): 1130
>> Size of entailment closure: 4421, which took 136 ms to compute (all
>> times wall-clock times on a laptop)
>>
>> Experiments indicated a bottleneck occurred due to a FILTER at the end,
>> a conjunction of many varA!=varB to anti-alias solutions.  VisualVM
>> profiling indicated over 99% of the time was spent in cycles of
>> recursive calls involving
>> org.apache.jena.sparql.engine.iterator.QueryIterRepeatApply.hasNextBinding,
>>
>> makeNextStep, *.QueryIteratorBase.hasNext, and
>> *.QueryIterProcessBinding.hasNextBinding.  (I have no idea what if
>> anything that means, but in case it is of interest to someone...)
>
> It does to me.
>
> These are the methods calls from moving from one intermediate result
> to another suggesting there are a lot of intermediate rows being
> processed.
>
>> The first exercise was to omit the anti-aliasing filter from the select
>> query itself and post-process the result to ignore solution rows with
>> aliased variable solutions.  That increased the size of the select
>> result from 192 to 576 rows but reduced the time from ~25 minutes to
>> ~450 ms.
>>
>> The initial query listed all rdf:type triples first, triples that
>> specified properties between nodes next, and a final big-bang filter at
>> the end.  The second exercise was to shuffle these triples into an order
>> intended to progressively narrow down the search space under the
>> assumption triples are processed in they order they are listed in the
>> query (as suggested in "Learning SPARQL," noting that Andy's earlier
>> post said ARQ does do some reordering for optimization).  The original
>> filter was fragmented into multiple smaller pieces that were also
>> shuffled among the other triples for this exercise.  This resulted in a
>> time of 111 ms, further reduced to 99 ms by switching from "!=" to
>> "!sameTerm" in the filters.
>
> Good news!
>
> What is the final query?
>
>>
>> I'm back in the saddle.  Thanks again for everyone's help.
>
>     Andy
>
>>
>> On 2/25/2020 12:33 PM, Andy Seaborne wrote:
>>> Current is 3.14.0.
>>>
>>> On 25/02/2020 17:38, Steve Vestal wrote:
>>>> I'm currently using 3.8.0 jars.
>>>>
>>>> On 2/25/2020 11:30 AM, Andy Seaborne wrote:
>>>>>
>>>>>
>>>>> On 25/02/2020 16:25, Steve Vestal wrote:
>>>>>> I read that chapter in DuCharme's book and have some things to try,
>>>>>> such
>>>>>> as moving the rdf:type triples around and fragmenting that single
>>>>>> filter
>>>>>> into pieces distributed throughout the query, and just doing my own
>>>>>> post-processing to get disjoint variables.  I'll report back when
>>>>>> time
>>>>>> permits.
>>>>>>
>>>>>> My reading did raise the question of what ARQ does for optimization,
>>>>>> which the book suggested can vary quite a bit between different
>>>>>> SPARQL
>>>>>> engines.   I took an admittedly very hasty peek at some sections of
>>>>>> the
>>>>>> online ARQ documentation, and it mentions optimization in a
>>>>>> number of
>>>>>> places, but is there a tutorial overview on do's and don'ts when
>>>>>> formulating the queries?  A specific question is, will user
>>>>>> ordering of
>>>>>> triples have a significant effect and should always be considered
>>>>>> because that's the order in which search will be done, or is the
>>>>>> optimizer going to do its own reordering regardless?  Your
>>>>>> suggestion
>>>>>> implies the former.
>>>>>
>>>>> ARQ does do some reordering but the issue here is made complicated by
>>>>> the fact that filter placement and reordering interact.
>>>>>
>>>>> Putting in {} sometime helps as well because
>>>>>
>>>>> { triple patterns FILTERs }
>>>>> { triple patterns FILTERs }
>>>>>
>>>>> is actually a different query and can push the optimizer to make
>>>>> better choices.
>>>>>
>>>>> Optimization is a lot of "it depends".
>>>>>
>>>>> (BTW which version are you running?)
>>>>>
>>>>>       Andy
>>>>>
>>>>>>
>>>>>> On 2/25/2020 8:54 AM, Andy Seaborne wrote:
>>>>>>> It might be worth reordering the tripe patterns and/or putting in
>>>>>>> some
>>>>>>> clustering: there is a large amount of cross product being done
>>>>>>> which
>>>>>>> means many,many unwanted or duplicate pieces of work.
>>>>>>>
>>>>>>> Fore example, move the rdf:type to the end (do you need them at
>>>>>>> all?)
>>>>>>>
>>>>>>>        Andy
>>>>>>>
>>>>>>> (Replaced long URIs for email:)
>>>>>>>
>>>>>>> ?leftA    <#simplexConnectTo>  ?connectionAA .
>>>>>>> ?connectionAA <#simplexConnectTo>  ?rightA .
>>>>>>>
>>>>>>> ?leftA    <#simplexConnectTo>  ?connectionAB .
>>>>>>> ?connectionAB <#simplexConnectTo>  ?rightB .
>>>>>>>
>>>>>>> ?leftB    <#simplexConnectTo>  ?connectionBA .
>>>>>>> ?connectionBA <#simplexConnectTo>  ?rightA .
>>>>>>>
>>>>>>> ?leftB    <#simplexConnectTo>  ?connectionBB .
>>>>>>> ?connectionBB <#simplexConnectTo>  ?rightB .
>>>>>>>
>>>>>>> ?connectionAA <fhowl/singlepointfailpattern#boundTo>
>>>>>>> ?singleHardware .
>>>>>>> ?connectionBA <fhowl/singlepointfailpattern#boundTo>
>>>>>>> ?singleHardware .
>>>>>>>
>>>>>>> ?connectionAA rdf:type <#portConnection> .
>>>>>>> ?connectionAB rdf:type <#portConnection> .
>>>>>>> ?connectionBA rdf:type <#portConnection> .
>>>>>>> ?connectionBB rdf:type <#portConnection> .
>>>>>>>
>>>>>>> ?leftA    rdf:type              <#thread> .
>>>>>>> ?leftB    rdf:type              <#thread> .
>>>>>>> ?rightA   rdf:type              <#thread> .
>>>>>>> ?rightB   rdf:type              <#thread> .
>>>>>>> ?singleHardware rdf:type              <#platform> .
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 24/02/2020 10:01, Rob Vesse wrote:
>>>>>>>> To add to what else has been said
>>>>>>>>
>>>>>>>> Query execution in Apache Jena ARQ is based upon lazy evaluation
>>>>>>>> wherever possible.  Calling execSelect() simply prepares a
>>>>>>>> ResultSet
>>>>>>>> that is capable of delivering the results but doesn't actually
>>>>>>>> evaluate the query and produce any results until you call
>>>>>>>> hasNext()/next().  When you call either of these methods then ARQ
>>>>>>>> does the minimum amount of work to return the next result (or
>>>>>>>> batch
>>>>>>>> of results) depending on the underlying algebra of the query.
>>>>>>>>
>>>>>>>> Rob
>>>>>>>>
>>>>>>>> On 23/02/2020, 18:58, "Steve Vestal"
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>         I'm looking for suggestions on a SPARQL performance issue.
>>>>>>>> My test
>>>>>>>>         model has ~800 sentences, and processing of one select
>>>>>>>> query
>>>>>>>> takes about
>>>>>>>>         25 minutes.  The query is a basic graph pattern with 9
>>>>>>>> variables
>>>>>>>> and 20
>>>>>>>>         triples, plus a filter that forces distinct variables
>>>>>>>> to have
>>>>>>>> distinct
>>>>>>>>         solutions using pair-wise not-equals constraints.  No
>>>>>>>> option
>>>>>>>> clause or
>>>>>>>>         anything else fancy.
>>>>>>>>              I am issuing the query against an inference model.
>>>>>>>> Most of
>>>>>>>> the asserted
>>>>>>>>         sentences are in imported models.  If I iterate over
>>>>>>>> all the
>>>>>>>> statements
>>>>>>>>         in the OntModel, I get ~1500 almost instantly.  I
>>>>>>>> experimented with
>>>>>>>>         several of the reasoners.
>>>>>>>>              Below is the basic control flow.  The thing I found
>>>>>>>> curious
>>>>>>>> is that the
>>>>>>>>         execSelect() method finishes almost instantly.  It is the
>>>>>>>> iteration over
>>>>>>>>         the ResultSet that is taking all the time, it seems in the
>>>>>>>> call to
>>>>>>>>         selectResult.hasNext(). The result has 192 rows, 9
>>>>>>>> columns.  The
>>>>>>>> results
>>>>>>>>         are provided in bursts of 8 rows each, with ~1 minute
>>>>>>>> between
>>>>>>>> bursts.
>>>>>>>>                      OntModel ontologyModel = getMyOntModel(); //
>>>>>>>> Tried
>>>>>>>> various reasoners
>>>>>>>>                 String selectQuery = getMySelectQuery();
>>>>>>>>                 QueryExecution selectExec =
>>>>>>>>         QueryExecutionFactory.create(selectQuery, ontologyModel);
>>>>>>>>                 ResultSet selectResult = selectExec.execSelect();
>>>>>>>>                 while (selectResult.hasNext()) {  // Time seems
>>>>>>>> to be
>>>>>>>> spent in
>>>>>>>>         hasNext
>>>>>>>>                     QuerySolution selectSolution =
>>>>>>>> selectResult.next();
>>>>>>>>                     for (String var : getMyVariablesOfInterest() {
>>>>>>>>                         RDFNode varValue =
>>>>>>>> selectSolution.get(var);
>>>>>>>>                         // process varValue
>>>>>>>>                     }
>>>>>>>>                 }
>>>>>>>>              Any suggestions would be appreciated.
>>>>>>>>          
>>

Re: SPARQL performance question

Reply via email to