Re: not performant query

Zen 98052 Fri, 17 Jun 2016 06:05:06 -0700

Thanks Andy! Answers inline ...

________________________________
From: Andy Seaborne <[email protected]>
Sent: Friday, June 10, 2016 11:06 AM
To: [email protected]
Subject: Re: not performant query

On 09/06/16 16:28, Zen 98052 wrote:
> Hi,
>
> I have a Sparql query below, which doesn't seem efficient.
>
> I noticed when running it, Jena calls execute(OpBGP opBGP,
> QueryIterator ...) so many times.

The default execution strategy - i.e. for in-memory use - is just that -
a default.

It

If your storage layer has different characteristics, e.g. there is a
certain about of overhead to go and get data, then the default execution
strategy maybe the wrong one.  That's the job of the optimizer and of
OpExecutor.

What does your storage layer look like?

[Z] We use Accumulo as the storage, based on 
https://wiki.apache.org/incubator/RyaProposal.
Basically, there will be 3 different tables, SPO, POS, and OSP, and based on 
the BGP, it will look up on one of those tables.
The serialized triple, i.e. SPO (delimited by null char) is stored as the key, 
which then we can just set the ranges to get all 'rows' that matched the filter 
efficiently.
Therefore, for each BGP that Jena calls my callback (in execute function with 
OpBGP arg), it'll submit request to the store, and iterate all rows.

> I have my own implementation in that function (overrides base class
> OpExecutor), which it'll make call to our back-end storage.
>
> From qparse output (attached below), it looks like the culprit is
> because the query has BGPs inside the FILTER, which explains the
> behavior I am seeing.

Possibly - there are several points where costs may arise.

 > ?o rdf:type ?type.
> FILTER NOT EXISTS
>     {
>       { ?o rdf:type v:Dynamic }
>       UNION
>       { ?o rdf:type v:Static }
>     }

FILTER NOT EXISTS {} can usually be written as MINUS or in this case a
expression FILTER on ?type as you have already fetched the rdf:type.

FILTER ( ?o != v:Dynamic && ?o != v:Static )

[Z] there's bug in the query, which '?o rdf:type ?type' pattern shouldn't be 
there, hence can't follow your suggestion, but it is still a useful tip for me.

The (sequence) is flowing results one-by-one into the nest step.
Depending on the storage, it may be better to switch that rewrite off
and use the hash-join built in - or do your own (parallel hash join maybe?)

Do you implement solving BGPs in your store and not relying on the
iterative solver that is used by default?

[Z] Yes. What other execcution strategies Jena provide (besides the default 
one)? Also, are there any existing samples?

> Is there a better way to re-write the query below to achieve same
> result, but more efficient (and lead to better performance)?

If you could give some details of the store it would help.  It's hard to
make many suggestions because it is all about the details.

        Andy

>
>
> Thanks,
>
> Z
>

Re: not performant query

Reply via email to