Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

A. Soroka Fri, 12 Feb 2016 15:08:19 -0800

Okay, I’m more confident at this point that you’re not well served by 
DatasetGraphInMemory, which has very strong assumptions about the speedy 
reachability of data. DSGInMemory was built for situations when all of the data 
is in core memory and multithreaded access is important. If you have a lot of 
core memory and can load the data fully, you might want to use it, but that 
doesn’t sound at all like your case. Otherwise, as far as what the right 
extension point is, I will need to defer to committers or more experienced 
devs, but I think you may need to look at DatasetGraph from a more 
close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for 
example.


Can you tell us a bit more about your full scenario? I don’t know much about 
STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of 
queries are you going to be using with this data? How quickly are users going 
to need to switch contexts between datasets?

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 2:44 PM, Joint <[email protected]> wrote:
> 
> 
> 
> Thanks for the fast response!
>  I have a set of disk based binary SDAI repository's which are based on 
> ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
> particular my files are IFC2x3 files which can be +1Gb. However after 
> processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb 
> STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB 
> I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar sized 
> STEP files...
> Typically only a small subset of the STEP file needs to be queried but 
> sometimes other parts need to be queried. Hence the on demand caching and 
> DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
> call the native SDAI find methods based on the node URI's in the case of a 
> cache miss, calling the add methods for the minted tuples, then passing on 
> the call to the super find. The underlying SDAI repository's are static so 
> once a subject is cached no other work is required.
> As the DatasetGraphInMemory is commented as very fast quad and triple access 
> it seemed a logical place to extend. The shim cache would be set to expire 
> entries and limit the total number of tuples power repository. This is 
> currently deployed on a 256Gb ram device.
> In the bigger picture l have a service very similar to Fuseki which allows 
> SPARQL requests to be made against Datasets which are either TDB or SDAI 
> cache backed.
> What was DatasetGraphInMemory created for..? ;-)
> Dick
> 
> -------- Original message --------
> From: "A. Soroka" <[email protected]> 
> Date: 12/02/2016  6:21 pm  (GMT+00:00) 
> To: [email protected] 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
> DatasetGraphInMemory 
> 
> I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
> better answered by other folks who are more familiar with Jena's DatasetGraph 
> implementations, or may actually not have anything to do with DatasetGraph 
> (see below for why). I will try to give some background information, though.
> 
> There are several paths by which where DatasetGraphInMemory can be performing 
> finds, but they come down to two places in the code, QuadTable:: and 
> TripleTable::find and in default operation, the concrete forms:
> 
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
> 
> for Quads and
> 
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
> 
> for Triples. Those methods are reused by all the differently-ordered indexes 
> within Hex- or TriTable, each of which will answer a find by selecting an 
> appropriately-ordered index based on the fixed and variable slots in the find 
> pattern and using the concrete methods above to stream tuples back.
> 
> As to why you are seeing your methods called in some places and not in 
> others, DatasetGraphBaseFind features methods like findInDftGraph(), 
> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are 
> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not 
> make a selection between those methods— that is done by DatasetGraphBaseFind. 
> So that is where you will find the logic that should answer your question.
> 
> Can you say a little more about your use case? You seem to have some 
> efficient representation in memory of your data (I hope it is in-memory— 
> otherwise it is a very bad choice to subclass DSGInMemory) and you want to 
> create tuples on the fly as queries are received. That is really not at all 
> what DSGInMemory is for (DSGInMemory is using map structures for indexing and 
> in default mode, uses persistent data structures to support 
> transactionality). I am wondering whether you might not be much better served 
> by tapping into Jena at a different place, perhaps implementing the Graph SPI 
> directly. Or, if reusing DSGInMemory is the right choice, just implementing 
> Quad- and TripleTable and using the constructor DatasetGraphInMemory(final 
> QuadTable i, final TripleTable t).
> 
> ---
> A. Soroka
> The University of Virginia Library
> 
>> On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]> wrote:
>> 
>> Hi.
>> 
>> Does anyone know the "find" paths through DatasetGraphInMemory please?
>> 
>> For example if I extend DatasetGraphInMemory and override
>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
>> * where {?s ?p ?o}" however if I override the other
>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
>> calling (but as I type I'm guessing it's optimised to return the HexTable
>> nodes...).
>> 
>> Would I be better off overriding HexTable and TriTable classes find methods
>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
>> one of these methods?
>> 
>> I need to know the root find methods so that I can shim them to create
>> triples/quads before they perform the find.
>> 
>> I need to create Triples/Quads on demand (because a bulk load would create
>> ~100M triples but only ~1000 are ever queried) and the source binary form
>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
>> than quads.
>> 
>> Regards Dick Murray.
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to