Thanks Andy! Answers inline ...
________________________________ From: Andy Seaborne <[email protected]> Sent: Friday, June 10, 2016 11:06 AM To: [email protected] Subject: Re: not performant query On 09/06/16 16:28, Zen 98052 wrote: > Hi, > > I have a Sparql query below, which doesn't seem efficient. > > I noticed when running it, Jena calls execute(OpBGP opBGP, > QueryIterator ...) so many times. The default execution strategy - i.e. for in-memory use - is just that - a default. It If your storage layer has different characteristics, e.g. there is a certain about of overhead to go and get data, then the default execution strategy maybe the wrong one. That's the job of the optimizer and of OpExecutor. What does your storage layer look like? [Z] We use Accumulo as the storage, based on https://wiki.apache.org/incubator/RyaProposal. Basically, there will be 3 different tables, SPO, POS, and OSP, and based on the BGP, it will look up on one of those tables. The serialized triple, i.e. SPO (delimited by null char) is stored as the key, which then we can just set the ranges to get all 'rows' that matched the filter efficiently. Therefore, for each BGP that Jena calls my callback (in execute function with OpBGP arg), it'll submit request to the store, and iterate all rows. > I have my own implementation in that function (overrides base class > OpExecutor), which it'll make call to our back-end storage. > > From qparse output (attached below), it looks like the culprit is > because the query has BGPs inside the FILTER, which explains the > behavior I am seeing. Possibly - there are several points where costs may arise. > ?o rdf:type ?type. > FILTER NOT EXISTS > { > { ?o rdf:type v:Dynamic } > UNION > { ?o rdf:type v:Static } > } FILTER NOT EXISTS {} can usually be written as MINUS or in this case a expression FILTER on ?type as you have already fetched the rdf:type. FILTER ( ?o != v:Dynamic && ?o != v:Static ) [Z] there's bug in the query, which '?o rdf:type ?type' pattern shouldn't be there, hence can't follow your suggestion, but it is still a useful tip for me. The (sequence) is flowing results one-by-one into the nest step. Depending on the storage, it may be better to switch that rewrite off and use the hash-join built in - or do your own (parallel hash join maybe?) Do you implement solving BGPs in your store and not relying on the iterative solver that is used by default? [Z] Yes. What other execcution strategies Jena provide (besides the default one)? Also, are there any existing samples? > Is there a better way to re-write the query below to achieve same > result, but more efficient (and lead to better performance)? If you could give some details of the store it would help. It's hard to make many suggestions because it is all about the details. Andy > > > Thanks, > > Z >
