Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Joint Fri, 12 Feb 2016 11:42:43 -0800

    
Thanks for the fast response!
 I have a set of disk based binary SDAI repository's which are based on 
ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
particular my files are IFC2x3 files which can be +1Gb. However after 
processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP 
file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get 
+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP 
files...
Typically only a small subset of the STEP file needs to be queried but 
sometimes other parts need to be queried. Hence the on demand caching and 
DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
call the native SDAI find methods based on the node URI's in the case of a 
cache miss, calling the add methods for the minted tuples, then passing on the 
call to the super find. The underlying SDAI repository's are static so once a 
subject is cached no other work is required.
As the DatasetGraphInMemory is commented as very fast quad and triple access it 
seemed a logical place to extend. The shim cache would be set to expire entries 
and limit the total number of tuples power repository. This is currently 
deployed on a 256Gb ram device.
In the bigger picture l have a service very similar to Fuseki which allows 
SPARQL requests to be made against Datasets which are either TDB or SDAI cache 
backed.
What was DatasetGraphInMemory created for..? ;-)
Dick

-------- Original message --------
From: "A. Soroka" <[email protected]> 
Date: 12/02/2016  6:21 pm  (GMT+00:00) 
To: [email protected] 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory 

I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
better answered by other folks who are more familiar with Jena's DatasetGraph 
implementations, or may actually not have anything to do with DatasetGraph (see 
below for why). I will try to give some background information, though.

There are several paths by which where DatasetGraphInMemory can be performing 
finds, but they come down to two places in the code, QuadTable:: and 
TripleTable::find and in default operation, the concrete forms:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100

for Quads and

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99

for Triples. Those methods are reused by all the differently-ordered indexes 
within Hex- or TriTable, each of which will answer a find by selecting an 
appropriately-ordered index based on the fixed and variable slots in the find 
pattern and using the concrete methods above to stream tuples back.

As to why you are seeing your methods called in some places and not in others, 
DatasetGraphBaseFind features methods like findInDftGraph(), 
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the 
methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a 
selection between those methods— that is done by DatasetGraphBaseFind. So that 
is where you will find the logic that should answer your question.

Can you say a little more about your use case? You seem to have some efficient 
representation in memory of your data (I hope it is in-memory— otherwise it is 
a very bad choice to subclass DSGInMemory) and you want to create tuples on the 
fly as queries are received. That is really not at all what DSGInMemory is for 
(DSGInMemory is using map structures for indexing and in default mode, uses 
persistent data structures to support transactionality). I am wondering whether 
you might not be much better served by tapping into Jena at a different place, 
perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the 
right choice, just implementing Quad- and TripleTable and using the constructor 
DatasetGraphInMemory(final QuadTable i, final TripleTable t).

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]> wrote:
> 
> Hi.
> 
> Does anyone know the "find" paths through DatasetGraphInMemory please?
> 
> For example if I extend DatasetGraphInMemory and override
> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
> * where {?s ?p ?o}" however if I override the other
> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
> calling (but as I type I'm guessing it's optimised to return the HexTable
> nodes...).
> 
> Would I be better off overriding HexTable and TriTable classes find methods
> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
> one of these methods?
> 
> I need to know the root find methods so that I can shim them to create
> triples/quads before they perform the find.
> 
> I need to create Triples/Quads on demand (because a bulk load would create
> ~100M triples but only ~1000 are ever queried) and the source binary form
> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
> than quads.
> 
> Regards Dick Murray.

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to