I started a Fuseki server (using the latest 0.2.3-SNAPSHOT release) with a
TDB database using a default configuration, and loaded a file with ~500K
triples into a graph called <data:input>. Now, I'm trying to do some
validation on that data, specifically find resources that use a property
but are not explicitly declared as members of that property's domain:

SELECT (count(*) as ?c) WHERE {
 GRAPH <data:input> {
  ?p rdfs:domain ?d . ?s ?p ?o
  MINUS { ?s a ?d } } }

(I know that if we're using rdfs:domain then any subjects using that
property can be inferred to be members of that property's domain, but
that's beside the point).

This query doesn't return in any reasonable amount of time (I let it run
for about half an hour). So, my next step was to eliminate the join in this
query using a temporary graph:

INSERT { GRAPH <data:output> { ?s <temp:typeByDomain> ?d } } WHERE {
 GRAPH <data:input> {
  ?p rdfs:domain ?d . ?s ?p ?o } }

SELECT (count(*) as ?c) WHERE {
  GRAPH <data:output> { ?s <temp:typeByDomain> ?d }
  MINUS { GRAPH <data:input> { ?s a ?d } } }

This query takes about 15 minutes to execute on my machine -- still longer
than I'd like, but at least it's progress.

Next I attempted to eliminate the effects of materializing the entire
result set by converting this to an ASK query:

ASK WHERE {
  GRAPH <data:output> { ?s <temp:typeByDomain> ?d }
  MINUS { GRAPH <data:input> { ?s a ?d } } }

This query takes about 5 minutes to complete, which is certainly better
than not completing at all but still slower than I would like. Is there any
way to tune or optimize TDB to better handle this query? As I mentioned, I
am using the default TDB configuration (just specifying --loc with an empty
directory to the fuseki-server script and accepting whatever it gives me).
>From what I can tell in the online help, most of the performance tuning
relates to the ordering of triple patterns within a join. Are there any
other suggestions to try?

FWIW, here are the approximate cardinalities of the various query patterns
in my dataset:
?s ?p ?o: 532,000
?p rdfs:domain ?d: 200
{?p rdfs:domain ?d . ?s ?p ?o}: 62,000
{?s rdf:type ?d}: 37,000
{?p rdfs:domain ?d . ?s ?p ?o} MINUS { ?s rdf:type ?d }: 39,000

Thanks,
Alex

Reply via email to