Re: SPARQL performance question

Steve Vestal Wed, 26 Feb 2020 03:26:39 -0800

Reporting back as requested to close this issue.

Recall the original select query took ~25 minutes on a small test case,
where the query was issued against an OntModel with four imports, trying
various reasoners since reasoning is necessary to get any results in
this test.


Number of asserted sentences: 712
Number of forward-chained entailments at OntModel creation (making
assumptions about reasoning and import handling): 1130
Size of entailment closure: 4421, which took 136 ms to compute (all
times wall-clock times on a laptop)

Experiments indicated a bottleneck occurred due to a FILTER at the end,
a conjunction of many varA!=varB to anti-alias solutions.  VisualVM
profiling indicated over 99% of the time was spent in cycles of
recursive calls involving
org.apache.jena.sparql.engine.iterator.QueryIterRepeatApply.hasNextBinding,
makeNextStep, *.QueryIteratorBase.hasNext, and
*.QueryIterProcessBinding.hasNextBinding.  (I have no idea what if
anything that means, but in case it is of interest to someone...)

The first exercise was to omit the anti-aliasing filter from the select
query itself and post-process the result to ignore solution rows with
aliased variable solutions.  That increased the size of the select
result from 192 to 576 rows but reduced the time from ~25 minutes to
~450 ms.

The initial query listed all rdf:type triples first, triples that
specified properties between nodes next, and a final big-bang filter at
the end.  The second exercise was to shuffle these triples into an order
intended to progressively narrow down the search space under the
assumption triples are processed in they order they are listed in the
query (as suggested in "Learning SPARQL," noting that Andy's earlier
post said ARQ does do some reordering for optimization).  The original
filter was fragmented into multiple smaller pieces that were also
shuffled among the other triples for this exercise.  This resulted in a
time of 111 ms, further reduced to 99 ms by switching from "!=" to
"!sameTerm" in the filters.

I'm back in the saddle.  Thanks again for everyone's help.

On 2/25/2020 12:33 PM, Andy Seaborne wrote:
> Current is 3.14.0.
>
> On 25/02/2020 17:38, Steve Vestal wrote:
>> I'm currently using 3.8.0 jars.
>>
>> On 2/25/2020 11:30 AM, Andy Seaborne wrote:
>>>
>>>
>>> On 25/02/2020 16:25, Steve Vestal wrote:
>>>> I read that chapter in DuCharme's book and have some things to try,
>>>> such
>>>> as moving the rdf:type triples around and fragmenting that single
>>>> filter
>>>> into pieces distributed throughout the query, and just doing my own
>>>> post-processing to get disjoint variables.  I'll report back when time
>>>> permits.
>>>>
>>>> My reading did raise the question of what ARQ does for optimization,
>>>> which the book suggested can vary quite a bit between different SPARQL
>>>> engines.   I took an admittedly very hasty peek at some sections of
>>>> the
>>>> online ARQ documentation, and it mentions optimization in a number of
>>>> places, but is there a tutorial overview on do's and don'ts when
>>>> formulating the queries?  A specific question is, will user
>>>> ordering of
>>>> triples have a significant effect and should always be considered
>>>> because that's the order in which search will be done, or is the
>>>> optimizer going to do its own reordering regardless?  Your suggestion
>>>> implies the former.
>>>
>>> ARQ does do some reordering but the issue here is made complicated by
>>> the fact that filter placement and reordering interact.
>>>
>>> Putting in {} sometime helps as well because
>>>
>>> { triple patterns FILTERs }
>>> { triple patterns FILTERs }
>>>
>>> is actually a different query and can push the optimizer to make
>>> better choices.
>>>
>>> Optimization is a lot of "it depends".
>>>
>>> (BTW which version are you running?)
>>>
>>>      Andy
>>>
>>>>
>>>> On 2/25/2020 8:54 AM, Andy Seaborne wrote:
>>>>> It might be worth reordering the tripe patterns and/or putting in
>>>>> some
>>>>> clustering: there is a large amount of cross product being done which
>>>>> means many,many unwanted or duplicate pieces of work.
>>>>>
>>>>> Fore example, move the rdf:type to the end (do you need them at all?)
>>>>>
>>>>>       Andy
>>>>>
>>>>> (Replaced long URIs for email:)
>>>>>
>>>>> ?leftA    <#simplexConnectTo>  ?connectionAA .
>>>>> ?connectionAA <#simplexConnectTo>  ?rightA .
>>>>>
>>>>> ?leftA    <#simplexConnectTo>  ?connectionAB .
>>>>> ?connectionAB <#simplexConnectTo>  ?rightB .
>>>>>
>>>>> ?leftB    <#simplexConnectTo>  ?connectionBA .
>>>>> ?connectionBA <#simplexConnectTo>  ?rightA .
>>>>>
>>>>> ?leftB    <#simplexConnectTo>  ?connectionBB .
>>>>> ?connectionBB <#simplexConnectTo>  ?rightB .
>>>>>
>>>>> ?connectionAA <fhowl/singlepointfailpattern#boundTo> 
>>>>> ?singleHardware .
>>>>> ?connectionBA <fhowl/singlepointfailpattern#boundTo> 
>>>>> ?singleHardware .
>>>>>
>>>>> ?connectionAA rdf:type <#portConnection> .
>>>>> ?connectionAB rdf:type <#portConnection> .
>>>>> ?connectionBA rdf:type <#portConnection> .
>>>>> ?connectionBB rdf:type <#portConnection> .
>>>>>
>>>>> ?leftA    rdf:type              <#thread> .
>>>>> ?leftB    rdf:type              <#thread> .
>>>>> ?rightA   rdf:type              <#thread> .
>>>>> ?rightB   rdf:type              <#thread> .
>>>>> ?singleHardware rdf:type              <#platform> .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 24/02/2020 10:01, Rob Vesse wrote:
>>>>>> To add to what else has been said
>>>>>>
>>>>>> Query execution in Apache Jena ARQ is based upon lazy evaluation
>>>>>> wherever possible.  Calling execSelect() simply prepares a ResultSet
>>>>>> that is capable of delivering the results but doesn't actually
>>>>>> evaluate the query and produce any results until you call
>>>>>> hasNext()/next().  When you call either of these methods then ARQ
>>>>>> does the minimum amount of work to return the next result (or batch
>>>>>> of results) depending on the underlying algebra of the query.
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> On 23/02/2020, 18:58, "Steve Vestal"
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>        I'm looking for suggestions on a SPARQL performance issue.
>>>>>> My test
>>>>>>        model has ~800 sentences, and processing of one select query
>>>>>> takes about
>>>>>>        25 minutes.  The query is a basic graph pattern with 9
>>>>>> variables
>>>>>> and 20
>>>>>>        triples, plus a filter that forces distinct variables to have
>>>>>> distinct
>>>>>>        solutions using pair-wise not-equals constraints.  No option
>>>>>> clause or
>>>>>>        anything else fancy.
>>>>>>             I am issuing the query against an inference model. 
>>>>>> Most of
>>>>>> the asserted
>>>>>>        sentences are in imported models.  If I iterate over all the
>>>>>> statements
>>>>>>        in the OntModel, I get ~1500 almost instantly.  I
>>>>>> experimented with
>>>>>>        several of the reasoners.
>>>>>>             Below is the basic control flow.  The thing I found
>>>>>> curious
>>>>>> is that the
>>>>>>        execSelect() method finishes almost instantly.  It is the
>>>>>> iteration over
>>>>>>        the ResultSet that is taking all the time, it seems in the
>>>>>> call to
>>>>>>        selectResult.hasNext(). The result has 192 rows, 9
>>>>>> columns.  The
>>>>>> results
>>>>>>        are provided in bursts of 8 rows each, with ~1 minute between
>>>>>> bursts.
>>>>>>                     OntModel ontologyModel = getMyOntModel(); //
>>>>>> Tried
>>>>>> various reasoners
>>>>>>                String selectQuery = getMySelectQuery();
>>>>>>                QueryExecution selectExec =
>>>>>>        QueryExecutionFactory.create(selectQuery, ontologyModel);
>>>>>>                ResultSet selectResult = selectExec.execSelect();
>>>>>>                while (selectResult.hasNext()) {  // Time seems to be
>>>>>> spent in
>>>>>>        hasNext
>>>>>>                    QuerySolution selectSolution =
>>>>>> selectResult.next();
>>>>>>                    for (String var : getMyVariablesOfInterest() {
>>>>>>                        RDFNode varValue = selectSolution.get(var);
>>>>>>                        // process varValue
>>>>>>                    }
>>>>>>                }
>>>>>>             Any suggestions would be appreciated.
>>>>>>          
>>>>>>

Re: SPARQL performance question

Reply via email to