Hi.
The quick full scenario is a distributed DaaS which supports queries, updates,
transforms and bulkloads. Andy Seaborne knows some of the detail because I
spoke to him previously. We achieve multiple writes by having parallel
Datasets, both traditional TDB and on demand in memory. Writes are sent to a
free dataset, free being not in a write transaction. That's a simplistic
overview...
Queries are handled by a dataset proxy which builds a dynamic dataset based on
the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find
method to issue the query to all known Datasets and return the union of
results. Various dataset proxies exist, some load TDBs, others load TTL files
into graphs, others dynamically create tuples. The common thing being they are
all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can
result in multiple Datasets being loaded to satisfy the query.
Nodes can be preloaded which then load Datasets to satisfy finds. This way the
system can be scaled to handle increased work loads. Also specific nodes can be
targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's structure. So
urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository
foo to be dynamically loaded into memory along with the quads which are
required to satisfy the find.
Typically a group of people will be working on a set of data so the first to
query will load the dataset then it will be accessed multiple times. There will
be an initial dynamic load of data which will tail off with some additional
loading over time.
Based on your description the DatasetGraphInMemory would seem to match the
dynamic load requirement. How did you foresee it being loaded? Is there a large
over head to using the add methods?
A typical scenario would be to search all SDAI repository's for some key
information then load detailed information in some, continuing to drill down.
Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've already
shimed the DGTriplesQuads so the actual caching code already exists and should
bed easy to hook on.
Dick
-------- Original message --------
From: "A. Soroka" <[email protected]>
Date: 12/02/2016 11:07 pm (GMT+00:00)
To: [email protected]
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
DatasetGraphInMemory
Okay, I’m more confident at this point that you’re not well served by
DatasetGraphInMemory, which has very strong assumptions about the speedy
reachability of data. DSGInMemory was built for situations when all of the data
is in core memory and multithreaded access is important. If you have a lot of
core memory and can load the data fully, you might want to use it, but that
doesn’t sound at all like your case. Otherwise, as far as what the right
extension point is, I will need to defer to committers or more experienced
devs, but I think you may need to look at DatasetGraph from a more
close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for
example.
Can you tell us a bit more about your full scenario? I don’t know much about
STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of
queries are you going to be using with this data? How quickly are users going
to need to switch contexts between datasets?
---
A. Soroka
The University of Virginia Library
> On Feb 12, 2016, at 2:44 PM, Joint <[email protected]> wrote:
>
>
>
> Thanks for the fast response!
> I have a set of disk based binary SDAI repository's which are based on
>ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In
>particular my files are IFC2x3 files which can be +1Gb. However after
>processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP
>file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get
>+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP
>files...
> Typically only a small subset of the STEP file needs to be queried but
> sometimes other parts need to be queried. Hence the on demand caching and
> DatasetGraphInMemory. The aim is that in the find methods I check a cache and
> call the native SDAI find methods based on the node URI's in the case of a
> cache miss, calling the add methods for the minted tuples, then passing on
> the call to the super find. The underlying SDAI repository's are static so
> once a subject is cached no other work is required.
> As the DatasetGraphInMemory is commented as very fast quad and triple access
> it seemed a logical place to extend. The shim cache would be set to expire
> entries and limit the total number of tuples power repository. This is
> currently deployed on a 256Gb ram device.
> In the bigger picture l have a service very similar to Fuseki which allows
> SPARQL requests to be made against Datasets which are either TDB or SDAI
> cache backed.
> What was DatasetGraphInMemory created for..? ;-)
> Dick
>
> -------- Original message --------
> From: "A. Soroka" <[email protected]>
> Date: 12/02/2016 6:21 pm (GMT+00:00)
> To: [email protected]
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> DatasetGraphInMemory
>
> I wrote the DatasetGraphInMemory code, but I suspect your question may be
> better answered by other folks who are more familiar with Jena's DatasetGraph
> implementations, or may actually not have anything to do with DatasetGraph
> (see below for why). I will try to give some background information, though.
>
> There are several paths by which where DatasetGraphInMemory can be performing
> finds, but they come down to two places in the code, QuadTable:: and
> TripleTable::find and in default operation, the concrete forms:
>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>
> for Quads and
>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>
> for Triples. Those methods are reused by all the differently-ordered indexes
> within Hex- or TriTable, each of which will answer a find by selecting an
> appropriately-ordered index based on the fixed and variable slots in the find
> pattern and using the concrete methods above to stream tuples back.
>
> As to why you are seeing your methods called in some places and not in
> others, DatasetGraphBaseFind features methods like findInDftGraph(),
> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
> make a selection between those methods— that is done by DatasetGraphBaseFind.
> So that is where you will find the logic that should answer your question.
>
> Can you say a little more about your use case? You seem to have some
> efficient representation in memory of your data (I hope it is in-memory—
> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
> create tuples on the fly as queries are received. That is really not at all
> what DSGInMemory is for (DSGInMemory is using map structures for indexing and
> in default mode, uses persistent data structures to support
> transactionality). I am wondering whether you might not be much better served
> by tapping into Jena at a different place, perhaps implementing the Graph SPI
> directly. Or, if reusing DSGInMemory is the right choice, just implementing
> Quad- and TripleTable and using the constructor DatasetGraphInMemory(final
> QuadTable i, final TripleTable t).
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]> wrote:
>>
>> Hi.
>>
>> Does anyone know the "find" paths through DatasetGraphInMemory please?
>>
>> For example if I extend DatasetGraphInMemory and override
>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
>> * where {?s ?p ?o}" however if I override the other
>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
>> calling (but as I type I'm guessing it's optimised to return the HexTable
>> nodes...).
>>
>> Would I be better off overriding HexTable and TriTable classes find methods
>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
>> one of these methods?
>>
>> I need to know the root find methods so that I can shim them to create
>> triples/quads before they perform the find.
>>
>> I need to create Triples/Quads on demand (because a bulk load would create
>> ~100M triples but only ~1000 are ever queried) and the source binary form
>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
>> than quads.
>>
>> Regards Dick Murray.
>