With some advice from Dave, I made a copy of the OntModel that hopefully
materialized the full entailment closure:
Model entailedModel = ModelFactory.createDefaultModel();
entailedModel.add(ontologyModel);
in less than one second, the results were:
Statements in ontology model: 1146
Entailed model org.apache.jena.rdf.model.impl.ModelCom size 4453
I ran the select query on this entailed model. It still takes about 25
minutes.
I see there is a chapter on Query Efficiency and Debugging in DuCharme's
book. Now seems like a good time for me to read that chapter.
Thanks for all the help.
On 2/24/2020 3:02 AM, Dave Reynolds wrote:
> On 23/02/2020 23:11, Steve Vestal wrote:
>> If I comment out the FILTER clause that prevents variable aliasing, the
>> query is processed almost immediately. The number of rows goes from 192
>> to 576, but it's fast.
>
> Interesting. That does suggest it might actually be Sparql rather than
> inference that's the bottleneck. The materialization experiment will
> be a test of that.
>
> Though looking at your query I wonder if you need inference at all -
> we can't see your data to be sure since the list doesn't allow
> attachments.
> Have you tried without any inference? Do you know what inference you
> are relying on?
>
>> What is the proper way to write a query when you
>> want a particular set of variables to have distinct solution values?
>
> Not sure there is a better way in general. However, I wonder if you
> can partition your query into subgroups, filter within the groups then
> do a simpler join on the results. That might reduce the combinatorics.
>
> However, I don't understand your query nor the modelling (especially
> around simplexConnect, which looks odd) so might be wrong about that.
>
>> I speculated that when I iterated over the statements in the OntModel,
>> and the number went from a model size() of ~1500 to ~4700 iterated
>> statements, that I was materializing the entire inference closure (which
>> was fast). Is there some other set of calls needed to do that?
>
> The jena inference engines supports a mix of forward and backward
> inference rules. The forward inference rules will run once and store
> all the results. That's the growth you are probably seeing. That's
> then efficient to query.
>
> The backward rules are run on-demand. They generally (this is
> controllable) cache the results of the particular triple patterns that
> are requested. Because they only cache against the specific patterns
> ("goals") they see then, depending on what order the goals come in,
> you can get cases where there's redundancy in those caches. Those
> caches aren't particularly well indexed either. You can certainly
> query one way and fill up one set of caches but then a different query
> asks for different patterns and more rules still need to fire.
>
> *If* multiple overlapping caches in the backward rules is the issue
> *then* materializing everything and not using inference after that
> can help. It's a balance of whether you are going to query for most of
> the data or just do a bunch of point probes. In the former case it's
> better to work everything out once. In the latter case better to use
> on demand rules.
>
> Your query pattern looks like it's going to touch everything.
>
>> Are there circumstances where it is faster to materialize the entire
>> closure and query a plain model than to query the inference model
>> itself?
>
> Yes, see earlier message, and above.
>
> Dave
>
>> On 2/23/2020 3:33 PM, Dave Reynolds wrote:
>>> The issues is not performance of SPARQL but performance of the
>>> inference engines.
>>>
>>> If you need some OWL inference then your best bet is OWLMicro.
>>>
>>> If that's tow slow to query directly then one option to try is to
>>> materialize the entire inference closure and then query that. You can
>>> that by simply copying the inference model to a plain model.
>>>
>>> If that's too slow then you'll need a higher performance third party
>>> reasoner.
>>>
>>> Dave
>>>
>>> On 23/02/2020 18:57, Steve Vestal wrote:
>>>> I'm looking for suggestions on a SPARQL performance issue. My test
>>>> model has ~800 sentences, and processing of one select query takes
>>>> about
>>>> 25 minutes. The query is a basic graph pattern with 9 variables
>>>> and 20
>>>> triples, plus a filter that forces distinct variables to have distinct
>>>> solutions using pair-wise not-equals constraints. No option clause or
>>>> anything else fancy.
>>>>
>>>> I am issuing the query against an inference model. Most of the
>>>> asserted
>>>> sentences are in imported models. If I iterate over all the
>>>> statements
>>>> in the OntModel, I get ~1500 almost instantly. I experimented with
>>>> several of the reasoners.
>>>>
>>>> Below is the basic control flow. The thing I found curious is that
>>>> the
>>>> execSelect() method finishes almost instantly. It is the iteration
>>>> over
>>>> the ResultSet that is taking all the time, it seems in the call to
>>>> selectResult.hasNext(). The result has 192 rows, 9 columns. The
>>>> results
>>>> are provided in bursts of 8 rows each, with ~1 minute between bursts.
>>>>
>>>> OntModel ontologyModel = getMyOntModel(); // Tried various
>>>> reasoners
>>>> String selectQuery = getMySelectQuery();
>>>> QueryExecution selectExec =
>>>> QueryExecutionFactory.create(selectQuery, ontologyModel);
>>>> ResultSet selectResult = selectExec.execSelect();
>>>> while (selectResult.hasNext()) { // Time seems to be
>>>> spent in
>>>> hasNext
>>>> QuerySolution selectSolution = selectResult.next();
>>>> for (String var : getMyVariablesOfInterest() {
>>>> RDFNode varValue = selectSolution.get(var);
>>>> // process varValue
>>>> }
>>>> }
>>>>
>>>> Any suggestions would be appreciated.
>>>>
>>