Thanks for the fast response!
I have a set of disk based binary SDAI repository's which are based on
ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In
particular my files are IFC2x3 files which can be +1Gb. However after
processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP
file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get
+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP
files...
Typically only a small subset of the STEP file needs to be queried but
sometimes other parts need to be queried. Hence the on demand caching and
DatasetGraphInMemory. The aim is that in the find methods I check a cache and
call the native SDAI find methods based on the node URI's in the case of a
cache miss, calling the add methods for the minted tuples, then passing on the
call to the super find. The underlying SDAI repository's are static so once a
subject is cached no other work is required.
As the DatasetGraphInMemory is commented as very fast quad and triple access it
seemed a logical place to extend. The shim cache would be set to expire entries
and limit the total number of tuples power repository. This is currently
deployed on a 256Gb ram device.
In the bigger picture l have a service very similar to Fuseki which allows
SPARQL requests to be made against Datasets which are either TDB or SDAI cache
backed.
What was DatasetGraphInMemory created for..? ;-)
Dick
-------- Original message --------
From: "A. Soroka" <[email protected]>
Date: 12/02/2016 6:21 pm (GMT+00:00)
To: [email protected]
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
DatasetGraphInMemory
I wrote the DatasetGraphInMemory code, but I suspect your question may be
better answered by other folks who are more familiar with Jena's DatasetGraph
implementations, or may actually not have anything to do with DatasetGraph (see
below for why). I will try to give some background information, though.
There are several paths by which where DatasetGraphInMemory can be performing
finds, but they come down to two places in the code, QuadTable:: and
TripleTable::find and in default operation, the concrete forms:
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
for Quads and
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
for Triples. Those methods are reused by all the differently-ordered indexes
within Hex- or TriTable, each of which will answer a find by selecting an
appropriately-ordered index based on the fixed and variable slots in the find
pattern and using the concrete methods above to stream tuples back.
As to why you are seeing your methods called in some places and not in others,
DatasetGraphBaseFind features methods like findInDftGraph(),
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the
methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a
selection between those methods— that is done by DatasetGraphBaseFind. So that
is where you will find the logic that should answer your question.
Can you say a little more about your use case? You seem to have some efficient
representation in memory of your data (I hope it is in-memory— otherwise it is
a very bad choice to subclass DSGInMemory) and you want to create tuples on the
fly as queries are received. That is really not at all what DSGInMemory is for
(DSGInMemory is using map structures for indexing and in default mode, uses
persistent data structures to support transactionality). I am wondering whether
you might not be much better served by tapping into Jena at a different place,
perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the
right choice, just implementing Quad- and TripleTable and using the constructor
DatasetGraphInMemory(final QuadTable i, final TripleTable t).
---
A. Soroka
The University of Virginia Library
> On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]> wrote:
>
> Hi.
>
> Does anyone know the "find" paths through DatasetGraphInMemory please?
>
> For example if I extend DatasetGraphInMemory and override
> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
> * where {?s ?p ?o}" however if I override the other
> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
> calling (but as I type I'm guessing it's optimised to return the HexTable
> nodes...).
>
> Would I be better off overriding HexTable and TriTable classes find methods
> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
> one of these methods?
>
> I need to know the root find methods so that I can shim them to create
> triples/quads before they perform the find.
>
> I need to create Triples/Quads on demand (because a bulk load would create
> ~100M triples but only ~1000 are ever queried) and the source binary form
> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
> than quads.
>
> Regards Dick Murray.