On Tue, Jun 12, 2012 at 11:21 AM, Paul Gearon <[email protected]> wrote:
> On Tue, Jun 12, 2012 at 6:41 AM, Andy Seaborne <[email protected]> wrote:
>> On 11/06/12 22:17, Alex Hall wrote:
> << Alex's question removed >>
>
>> Hi Alex,
>>
>> ARQ does not handle MINUS in any particularly clever way, in fact it handles
>> it in a quite naive way. It evaluates and materialises the right-hand-side
>> (RHS) and then loops on the left (LHS) to check each row for compatibility.
>>
>> I understood (from quoll) that Mulgara does MINUS better but also that is
>> MINUS more like FILTER NOT EXISTS. Is there any knowledge Jena can employ
>> and do better?
>
> I remember having the conversation, but I can't recall why the
> conclusion ended up that way. I just looked at it more carefully, and
> Mulgara's MINUS implements the same approach described for MINUS in
> SPARQL 1.1. The main points are the the Mulgara-MINUS does not have
> the broader scope of FILTER NOT EXISTS, and it removes results from
> the minuend based on compatible results in the subtrahend (if there
> are no matching variables between the minuend and subtrahend then
> nothing is removed from the minuend).
>
> Excluding some of the edge cases, the MINUS code is very similar to
> the join code. In each case, as the left-hand-side is iterated over,
> the right-hand-side is searched for a compatible row. Care is taken to
> ensure that the right-hand-side is always efficiently searchable. (The
> RHS will be wrapping a result of an index search, or a lazy operation
> whose parameters are results of index searches).
ARQ currently materializes the RHS into an ArrayList which it then
uses for a linear search on each LHS iteration. An easyish
performance improvement would be to replace that with a HashSet
containing bindings made up of the shared variables.
A more advanced change would be do what you've done and not
materialize the RHS when it has a fast index search, taking scope
differences and shared variables into account. Although as you
mention this becomes essentially the same query optimization decision
as hash vs nested loop join at this point.
> I'm interested in looking at this. Yes, I can find it myself, but it
> would save time if you can provide a rough area in the Jena codebase
> to look for it please.
See "com.hp.hpl.jena.sparql.engine.main.OpExecutor" and
"com.hp.hpl.jena.sparql.engine.iterator.QueryIterMinus".
>> FILTER NOT EXISTS can be faster where equivalent (which it is here I think)
>> could you try that?
>>
>>
>> ?p rdfs:domain ?d .
>> ?s ?p ?o
>> FILTER NOT EXISTS { ?s a ?d }
>>
>>
>> but the
>>
>>
>> ?p rdfs:domain ?d .
>> ?s ?p ?o
>>
>> is going to be a bit painful given ARQs current streaming evaluation (other
>> strategies might be better in this case).
>
> Do you have pointers on how the streaming evaluations works?
>
>> {?p rdfs:domain ?d . ?s ?p ?o}: 62,000
>>
>> At 500K triples, reading into memory [a separate model or dataset] and
>> executing the query might be better.
>
> Unfortunately, transactionality is important here. Do the in-memory
> models handle that?
>
> Regards,
> Paul Gearon