Hi,
On 12/07/16 05:13, Niels Andersen wrote:
Dear Jena user community,
We are using the general purpose rule engine to be able to only specify the
rules that we need in our model. We have read and are familiar with
https://jena.apache.org/documentation/inference/
Is there a document that describes the guidelines for creating high performance
rules in Jena?
Not that I'm aware of.
Specifically we are interested in:
* What is it that makes inferencing costly? Is it the time it takes to
run the query (on a large model with millions of triples) or the time it takes
to generate new triples?
Rules are expensive generally because they can easily interact to create
a very large space of possible results. It's quite easy to get an
exponential growth.
For forward rules then you are materializing the whole set of results.
The forward engine works by keeping track of all the partially matched
rules, so that when a new triple is deduced you drop that triple in the
network and see which rules can now fire as a result. Thus you avoid
running the full rule antecedents as queries each time but have the cost
of keeping and searching a lot of state representing the partially
matched rules. So I would guess that cost dominates over the simple
generation of the result triples and there isn't much query, as such,
going on.
Whereas the backward rules are essentially exploring the search space
and each step in that involves issuing new (triple pattern) queries and
keeping track of the how far you are through the search space. So again
I expect the state tracking dominates over the generating the final
triples but now there are a lot of queries going on.
o Examples of good and bad rules
I'm not sure that's answerable in the abstract, it depends on what your
rules are supposed to do.
o What type of performance should we expect?
* How to use backwards and hybrid rules (including specific examples).
We cannot seem to get the rule to trigger when querying through the Jena API.
If you want some non-trivial examples then look at the rule sets for
OWL/RDFS reasoning (in resources/etc/). For example rdfs.rules is a
brute force expression of (most of) RDFS and can be run in either
forward or backward mode, whereas rdfs-fb.rules is the same but using
hybrid rules to achieve a different performance trade-off.
[The real rules used are the rdfs-fb-tgc-*.rules where "tgc" means that
they assume use of transitive reasoner (transitive graph closure).]
* Is it possible to implement asynchronous rule execution so that the
reasoner does not hold up the triple store when the reasoner is triggered on
the next select statement?
For backward rules then you could create a InfGraph for each query
(assuming each query arrives on a different thread). Though not helpful
if you use any tabling.
For forward rules then the question doesn't make that much sense, once
the forward engine has run then each further query won't trigger any
more reasoning until the data changes.
* What type of hardware configuration is recommended for fast reasoner
execution?
Fast (!) and lots of memory.
What can be done to increase the parallelism of the reasoner?
Rewrite it from scratch!
* How do we create a persistent inferred model that can still be
processed using the reasoner (i.e. persisted triples can be deleted by the
reasoner at a later stage)? Right now the reasoner runs every time we start the
system, it could take minutes to infer all the triples.
If your data doesn't change much and you are using forward inference
then use the reasoner "off line". Run it once, in memory, take all the
results and put them in a graph in the persistent store.
If the data changes then schedule a new rebuild.
The current Jena rule engines are really not that well adapted to life
with a persistent store.
Dave