On Tue, Jun 12, 2012 at 6:41 AM, Andy Seaborne <[email protected]> wrote:
> On 11/06/12 22:17, Alex Hall wrote:
<< Alex's question removed >>
> Hi Alex,
>
> ARQ does not handle MINUS in any particularly clever way, in fact it handles
> it in a quite naive way. It evaluates and materialises the right-hand-side
> (RHS) and then loops on the left (LHS) to check each row for compatibility.
>
> I understood (from quoll) that Mulgara does MINUS better but also that is
> MINUS more like FILTER NOT EXISTS. Is there any knowledge Jena can employ
> and do better?
I remember having the conversation, but I can't recall why the
conclusion ended up that way. I just looked at it more carefully, and
Mulgara's MINUS implements the same approach described for MINUS in
SPARQL 1.1. The main points are the the Mulgara-MINUS does not have
the broader scope of FILTER NOT EXISTS, and it removes results from
the minuend based on compatible results in the subtrahend (if there
are no matching variables between the minuend and subtrahend then
nothing is removed from the minuend).
Excluding some of the edge cases, the MINUS code is very similar to
the join code. In each case, as the left-hand-side is iterated over,
the right-hand-side is searched for a compatible row. Care is taken to
ensure that the right-hand-side is always efficiently searchable. (The
RHS will be wrapping a result of an index search, or a lazy operation
whose parameters are results of index searches).
I'm interested in looking at this. Yes, I can find it myself, but it
would save time if you can provide a rough area in the Jena codebase
to look for it please.
> FILTER NOT EXISTS can be faster where equivalent (which it is here I think)
> could you try that?
>
>
> ?p rdfs:domain ?d .
> ?s ?p ?o
> FILTER NOT EXISTS { ?s a ?d }
>
>
> but the
>
>
> ?p rdfs:domain ?d .
> ?s ?p ?o
>
> is going to be a bit painful given ARQs current streaming evaluation (other
> strategies might be better in this case).
Do you have pointers on how the streaming evaluations works?
> {?p rdfs:domain ?d . ?s ?p ?o}: 62,000
>
> At 500K triples, reading into memory [a separate model or dataset] and
> executing the query might be better.
Unfortunately, transactionality is important here. Do the in-memory
models handle that?
Regards,
Paul Gearon