On 23/02/2020 23:11, Steve Vestal wrote:
If I comment out the FILTER clause that prevents variable aliasing, the
query is processed almost immediately. The number of rows goes from 192
to 576, but it's fast.
Interesting. That does suggest it might actually be Sparql rather than
inference that's the bottleneck. The materialization experiment will be
a test of that.
Though looking at your query I wonder if you need inference at all - we
can't see your data to be sure since the list doesn't allow attachments.
Have you tried without any inference? Do you know what inference you are
relying on?
What is the proper way to write a query when you
want a particular set of variables to have distinct solution values?
Not sure there is a better way in general. However, I wonder if you can
partition your query into subgroups, filter within the groups then do a
simpler join on the results. That might reduce the combinatorics.
However, I don't understand your query nor the modelling (especially
around simplexConnect, which looks odd) so might be wrong about that.
I speculated that when I iterated over the statements in the OntModel,
and the number went from a model size() of ~1500 to ~4700 iterated
statements, that I was materializing the entire inference closure (which
was fast). Is there some other set of calls needed to do that?
The jena inference engines supports a mix of forward and backward
inference rules. The forward inference rules will run once and store all
the results. That's the growth you are probably seeing. That's then
efficient to query.
The backward rules are run on-demand. They generally (this is
controllable) cache the results of the particular triple patterns that
are requested. Because they only cache against the specific patterns
("goals") they see then, depending on what order the goals come in, you
can get cases where there's redundancy in those caches. Those caches
aren't particularly well indexed either. You can certainly query one way
and fill up one set of caches but then a different query asks for
different patterns and more rules still need to fire.
*If* multiple overlapping caches in the backward rules is the issue
*then* materializing everything and not using inference after that can
help. It's a balance of whether you are going to query for most of the
data or just do a bunch of point probes. In the former case it's better
to work everything out once. In the latter case better to use on demand
rules.
Your query pattern looks like it's going to touch everything.
Are there circumstances where it is faster to materialize the entire
closure and query a plain model than to query the inference model itself?
Yes, see earlier message, and above.
Dave
On 2/23/2020 3:33 PM, Dave Reynolds wrote:
The issues is not performance of SPARQL but performance of the
inference engines.
If you need some OWL inference then your best bet is OWLMicro.
If that's tow slow to query directly then one option to try is to
materialize the entire inference closure and then query that. You can
that by simply copying the inference model to a plain model.
If that's too slow then you'll need a higher performance third party
reasoner.
Dave
On 23/02/2020 18:57, Steve Vestal wrote:
I'm looking for suggestions on a SPARQL performance issue. My test
model has ~800 sentences, and processing of one select query takes about
25 minutes. The query is a basic graph pattern with 9 variables and 20
triples, plus a filter that forces distinct variables to have distinct
solutions using pair-wise not-equals constraints. No option clause or
anything else fancy.
I am issuing the query against an inference model. Most of the asserted
sentences are in imported models. If I iterate over all the statements
in the OntModel, I get ~1500 almost instantly. I experimented with
several of the reasoners.
Below is the basic control flow. The thing I found curious is that the
execSelect() method finishes almost instantly. It is the iteration over
the ResultSet that is taking all the time, it seems in the call to
selectResult.hasNext(). The result has 192 rows, 9 columns. The results
are provided in bursts of 8 rows each, with ~1 minute between bursts.
OntModel ontologyModel = getMyOntModel(); // Tried various
reasoners
String selectQuery = getMySelectQuery();
QueryExecution selectExec =
QueryExecutionFactory.create(selectQuery, ontologyModel);
ResultSet selectResult = selectExec.execSelect();
while (selectResult.hasNext()) { // Time seems to be spent in
hasNext
QuerySolution selectSolution = selectResult.next();
for (String var : getMyVariablesOfInterest() {
RDFNode varValue = selectSolution.get(var);
// process varValue
}
}
Any suggestions would be appreciated.