Re: SPARQL performance question

Steve Vestal Mon, 24 Feb 2020 05:56:01 -0800

Responses and questions inserted...

On 2/24/2020 3:02 AM, Dave Reynolds wrote:
> On 23/02/2020 23:11, Steve Vestal wrote:
>> If I comment out the FILTER clause that prevents variable aliasing, the
>> query is processed almost immediately.  The number of rows goes from 192
>> to 576, but it's fast.  
>
> Interesting. That does suggest it might actually be Sparql rather than
> inference that's the bottleneck. The materialization experiment will
> be a test of that.
I earlier iterated over statements.  To make sure that I fully
materialize all possible entailments, do I need to query for ?s ?p ?o? 
Any suggestions on the most efficient way to do this materialization? 
>
> Though looking at your query I wonder if you need inference at all -
> we can't see your data to be sure since the list doesn't allow
> attachments.
> Have you tried without any inference? Do you know what inference you
> are relying on?
I tried the following.


    OntModelSpec.OWL_DL_MEM_RULE_INF
    OntModelSpec.OWL_MEM_RULE_INF
    OntModelSpec.OWL_LITE_MEM_TRANS_INF
    OntModelSpec.OWL_LITE_MEM_RULES_INF
    OntModelSpec.OWL_MEM_RDFS_INF
    OntModelSpec.OWL_MEM_MICRO_RULE_INF
    OntModelSpec.OWL_MEM

I do need some reasoning, minimally chasing through some shallow type
hierarchies and transitive properties/predicates/roles.  Without it, the
query very quickly returns nothing.  I didn't keep a record, but I
vaguely recall most if not all of the *_INF above provide a non-empty
result, and all non-empty results took about the same time.   This is a
test case, actual models will be large enough that I expect to need
backwards/incremental reasoning.  However,...

>
>> What is the proper way to write a query when you
>> want a particular set of variables to have distinct solution values?
>
> Not sure there is a better way in general. However, I wonder if you
> can partition your query into subgroups, filter within the groups then
> do a simpler join on the results. That might reduce the combinatorics.
I had earlier thought briefly about coming up with a more general
pre-fetch query that would collect a set of asserted triples guaranteed
to include triples of possible interest into a separate model of
(hopefully) much smaller size, and then running my sequence of
queries-with-reasoning on that.  Has this sort of thing been done
successfully?  What gave me pause is that some triples derived from
query results will need to be added back into the original model, and
I'm not sure how blank nodes would play into that. But pre-fetch models
in practice would likely not be smaller than this test case model.

In one or two earlier postings, mention was made of Pellet as being more
efficient and complete in some cases.  My impression is that a Pellet
reasoner is not bundled with Jena, and I would have to find and install
one myself (although the Protege wiki mentions one is available in
Jena).  Is that correct?  A general web search turned up a number of
sources, e.g., openpellet, mindswap, stardog.   Does anyone have any
recommendations and a link to a site that has the master version
compatible with Jena 3 and having a reasonably clear and smooth
install?  Are any of the other OWL reasoners out there packaged for use
with Jena?

>
> However, I don't understand your query nor the modelling (especially
> around simplexConnect, which looks odd) so might be wrong about that.
>
>> I speculated that when I iterated over the statements in the OntModel,
>> and the number went from a model size() of ~1500 to ~4700 iterated
>> statements, that I was materializing the entire inference closure (which
>> was fast).  Is there some other set of calls needed to do that?
>
> The jena inference engines supports a mix of forward and backward
> inference rules. The forward inference rules will run once and store
> all the results. That's the growth you are probably seeing. That's
> then efficient to query.
>
> The backward rules are run on-demand. They generally (this is
> controllable) cache the results of the particular triple patterns that
> are requested. Because they only cache against the specific patterns
> ("goals") they see then, depending on what order the goals come in,
> you can get cases where there's redundancy in those caches. Those
> caches aren't particularly well indexed either. You can certainly
> query one way and fill up one set of caches but then a different query
> asks for different patterns and more rules still need to fire.
>
> *If* multiple overlapping caches in the backward rules is the issue
> *then* materializing everything and not using inference after that 
> can help. It's a balance of whether you are going to query for most of
> the data or just do a bunch of point probes. In the former case it's
> better to work everything out once. In the latter case better to use
> on demand rules.
>
> Your query pattern looks like it's going to touch everything.
>
>> Are there circumstances where it is faster to materialize the entire
>> closure and query a plain model than to query the inference model
>> itself?
>
> Yes, see earlier message, and above.
>
> Dave
>
>> On 2/23/2020 3:33 PM, Dave Reynolds wrote:
>>> The issues is not performance of SPARQL but performance of the
>>> inference engines.
>>>
>>> If you need some OWL inference then your best bet is OWLMicro.
>>>
>>> If that's tow slow to query directly then one option to try is to
>>> materialize the entire inference closure and then query that. You can
>>> that by simply copying the inference model to a plain model.
>>>
>>> If that's too slow then you'll need a higher performance third party
>>> reasoner.
>>>
>>> Dave
>>>
>>> On 23/02/2020 18:57, Steve Vestal wrote:
>>>> I'm looking for suggestions on a SPARQL performance issue.  My test
>>>> model has ~800 sentences, and processing of one select query takes
>>>> about
>>>> 25 minutes.  The query is a basic graph pattern with 9 variables
>>>> and 20
>>>> triples, plus a filter that forces distinct variables to have distinct
>>>> solutions using pair-wise not-equals constraints.  No option clause or
>>>> anything else fancy.
>>>>
>>>> I am issuing the query against an inference model.  Most of the
>>>> asserted
>>>> sentences are in imported models.  If I iterate over all the
>>>> statements
>>>> in the OntModel, I get ~1500 almost instantly.  I experimented with
>>>> several of the reasoners.
>>>>
>>>> Below is the basic control flow.  The thing I found curious is that
>>>> the
>>>> execSelect() method finishes almost instantly.  It is the iteration
>>>> over
>>>> the ResultSet that is taking all the time, it seems in the call to
>>>> selectResult.hasNext(). The result has 192 rows, 9 columns.  The
>>>> results
>>>> are provided in bursts of 8 rows each, with ~1 minute between bursts.
>>>>
>>>>           OntModel ontologyModel = getMyOntModel(); // Tried various
>>>> reasoners
>>>>           String selectQuery = getMySelectQuery();
>>>>           QueryExecution selectExec =
>>>> QueryExecutionFactory.create(selectQuery, ontologyModel);
>>>>           ResultSet selectResult = selectExec.execSelect();
>>>>           while (selectResult.hasNext()) {  // Time seems to be
>>>> spent in
>>>> hasNext
>>>>               QuerySolution selectSolution = selectResult.next();
>>>>               for (String var : getMyVariablesOfInterest() {
>>>>                   RDFNode varValue = selectSolution.get(var);
>>>>                   // process varValue
>>>>               }
>>>>           }
>>>>
>>>> Any suggestions would be appreciated.
>>>>
>>

Re: SPARQL performance question

Reply via email to