On Tue, Jun 12, 2012 at 2:54 PM, Stephen Allen <[email protected]> wrote:
> On Tue, Jun 12, 2012 at 11:21 AM, Paul Gearon <[email protected]> wrote: > > On Tue, Jun 12, 2012 at 6:41 AM, Andy Seaborne <[email protected]> wrote: > >> On 11/06/12 22:17, Alex Hall wrote: > > << Alex's question removed >> > > > >> Hi Alex, > >> > >> ARQ does not handle MINUS in any particularly clever way, in fact it > handles > >> it in a quite naive way. It evaluates and materialises the > right-hand-side > >> (RHS) and then loops on the left (LHS) to check each row for > compatibility. > >> > >> I understood (from quoll) that Mulgara does MINUS better but also that > is > >> MINUS more like FILTER NOT EXISTS. Is there any knowledge Jena can > employ > >> and do better? > > > > I remember having the conversation, but I can't recall why the > > conclusion ended up that way. I just looked at it more carefully, and > > Mulgara's MINUS implements the same approach described for MINUS in > > SPARQL 1.1. The main points are the the Mulgara-MINUS does not have > > the broader scope of FILTER NOT EXISTS, and it removes results from > > the minuend based on compatible results in the subtrahend (if there > > are no matching variables between the minuend and subtrahend then > > nothing is removed from the minuend). > > > > Excluding some of the edge cases, the MINUS code is very similar to > > the join code. In each case, as the left-hand-side is iterated over, > > the right-hand-side is searched for a compatible row. Care is taken to > > ensure that the right-hand-side is always efficiently searchable. (The > > RHS will be wrapping a result of an index search, or a lazy operation > > whose parameters are results of index searches). > > ARQ currently materializes the RHS into an ArrayList which it then > uses for a linear search on each LHS iteration. An easyish > performance improvement would be to replace that with a HashSet > containing bindings made up of the shared variables. > > I think this would give the best bang for the buck over the current situation. The difficult part there is going to be finding the common variables. Mulgara does this by having each Tuple (roughly equivalent to Jena's QueryIterator) provide a list of variables that it contains. I see that's QueryIterator doesn't provide this, but it looks like it could be found at the point where the iterator is constructed in OpMinus by calling OpVars on the LHS and RHS. What's the difference between OpVars.allVars and OpVars.patternVars? Yet another possible optimization here is to avoid materializing the RHS until the first result is requested, since it's possible that the RHS is empty, or that the minus doesn't need to be evaluated at all due to an empty result in some higher-level join. A more advanced change would be do what you've done and not > materialize the RHS when it has a fast index search, taking scope > differences and shared variables into account. Although as you > mention this becomes essentially the same query optimization decision > as hash vs nested loop join at this point. > Depending on the relative size of the LHS and RHS, it might make more sense to materialize a small RHS than to hit a disk-based index thousands of times. At any rate, that's still more information that would need to be accessible through the QueryIterator interface to the optimizer. To this end, Mulgara wound up with an annotation system on its Tuples interface so that adding a new type of optimization didn't force code changes in every iterator implementation. Thanks, Alex > > > I'm interested in looking at this. Yes, I can find it myself, but it > > would save time if you can provide a rough area in the Jena codebase > > to look for it please. > > See "com.hp.hpl.jena.sparql.engine.main.OpExecutor" and > "com.hp.hpl.jena.sparql.engine.iterator.QueryIterMinus". > > > >> FILTER NOT EXISTS can be faster where equivalent (which it is here I > think) > >> could you try that? > >> > >> > >> ?p rdfs:domain ?d . > >> ?s ?p ?o > >> FILTER NOT EXISTS { ?s a ?d } > >> > >> > >> but the > >> > >> > >> ?p rdfs:domain ?d . > >> ?s ?p ?o > >> > >> is going to be a bit painful given ARQs current streaming evaluation > (other > >> strategies might be better in this case). > > > > Do you have pointers on how the streaming evaluations works? > > > >> {?p rdfs:domain ?d . ?s ?p ?o}: 62,000 > >> > >> At 500K triples, reading into memory [a separate model or dataset] and > >> executing the query might be better. > > > > Unfortunately, transactionality is important here. Do the in-memory > > models handle that? > > > > Regards, > > Paul Gearon >
