On 11/06/12 22:17, Alex Hall wrote:
I started a Fuseki server (using the latest 0.2.3-SNAPSHOT release) with a
TDB database using a default configuration, and loaded a file with ~500K
triples into a graph called<data:input>. Now, I'm trying to do some
validation on that data, specifically find resources that use a property
but are not explicitly declared as members of that property's domain:

SELECT (count(*) as ?c) WHERE {
  GRAPH<data:input>  {
   ?p rdfs:domain ?d . ?s ?p ?o
   MINUS { ?s a ?d } } }

(I know that if we're using rdfs:domain then any subjects using that
property can be inferred to be members of that property's domain, but
that's beside the point).

This query doesn't return in any reasonable amount of time (I let it run
for about half an hour). So, my next step was to eliminate the join in this
query using a temporary graph:

INSERT { GRAPH<data:output>  { ?s<temp:typeByDomain>  ?d } } WHERE {
  GRAPH<data:input>  {
   ?p rdfs:domain ?d . ?s ?p ?o } }

SELECT (count(*) as ?c) WHERE {
   GRAPH<data:output>  { ?s<temp:typeByDomain>  ?d }
   MINUS { GRAPH<data:input>  { ?s a ?d } } }

This query takes about 15 minutes to execute on my machine -- still longer
than I'd like, but at least it's progress.

Next I attempted to eliminate the effects of materializing the entire
result set by converting this to an ASK query:

ASK WHERE {
   GRAPH<data:output>  { ?s<temp:typeByDomain>  ?d }
   MINUS { GRAPH<data:input>  { ?s a ?d } } }

This query takes about 5 minutes to complete, which is certainly better
than not completing at all but still slower than I would like. Is there any
way to tune or optimize TDB to better handle this query? Afors I mentioned, I
am using the default TDB configuration (just specifying --loc with an empty
directory to the fuseki-server script and accepting whatever it gives me).
 From what I can tell in the online help, most of the performance tuning
relates to the ordering of triple patterns within a join. Are there any
other suggestions to try?

FWIW, here are the approximate cardinalities of the various query patterns
in my dataset:
?s ?p ?o: 532,000
?p rdfs:domain ?d: 200
{?p rdfs:domain ?d . ?s ?p ?o}: 62,000
{?s rdf:type ?d}: 37,000
{?p rdfs:domain ?d . ?s ?p ?o} MINUS { ?s rdf:type ?d }: 39,000

Thanks,
Alex


Hi Alex,

ARQ does not handle MINUS in any particularly clever way, in fact it handles it in a quite naive way. It evaluates and materialises the right-hand-side (RHS) and then loops on the left (LHS) to check each row for compatibility.

I understood (from quoll) that Mulgara does MINUS better but also that is MINUS more like FILTER NOT EXISTS. Is there any knowledge Jena can employ and do better?

FILTER NOT EXISTS can be faster where equivalent (which it is here I think) could you try that?

   ?p rdfs:domain ?d .
   ?s ?p ?o
   FILTER NOT EXISTS { ?s a ?d }


but the

   ?p rdfs:domain ?d .
   ?s ?p ?o

is going to be a bit painful given ARQs current streaming evaluation (other strategies might be better in this case).

        Andy

{?p rdfs:domain ?d . ?s ?p ?o}: 62,000

At 500K triples, reading into memory [a separate model or dataset] and executing the query might be better.

        Andy

Reply via email to