Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Andy Seaborne Fri, 04 Mar 2016 05:41:40 -0800

On 03/03/16 10:47, Joint wrote:



Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got to the 
point where I had overridden more methods than not. In addition writes are not 
supported and contains methods which call find(ANY, ANY, ANY) play havoc with 
an on demand triple caching algorithm! ;-) I'm using the TriTable because it 
fits and quads are spoofed via triple to quad iterator.
I have a set of filters and handles which the find triple is compared against 
and either passed straight to the TriTable if the triple has been handled 
before or its passed to the appropriate handle which adds the triples to the 
TriTable then calls the find. As the underlying data is a tree a cache depth 
can be set which allows related triples to be cached. Also the cache can be 
preloaded with common triples e.g. ANY RDF:type ?.
Would you consider a generic version for the Jena code base?

Sure - if it is a general capability, then it would be good to have aframework for writing read-only adapters to external data.


<insert stuff about tests documentation etc>

I'm still unclear as to why you aren't hooking to one subclass ofDatasetGraph but maybe the code will make that clearer.


    Andy

Dick

-------- Original message --------
From: Andy Seaborne <[email protected]>
Date: 18/02/2016  6:31 pm  (GMT+00:00)
To: [email protected]
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
   DatasetGraphInMemory

Hi,

I'm not seeing how tapping into the implementation of
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find()
call through to the now-setup data.

        Andy

On 16/02/16 17:42, A. Soroka wrote:

Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?


No, I certainly did not mean to give that impression, and I don’t think it is 
entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
loading. That doesn’t mean it can’t be used that way, but that was not in the 
design, which assumed that all tuples take about the same amount of time to 
access and that all of the same type are coming from the same implementation 
(in a QuadTable and a TripleTable).

The overhead of mutating a dataset is mostly inside the implementations of 
TupleTable that are actually used to store tuples. You should be aware that 
TupleTable extends TransactionalComponent, so if you want to use it to create 
some kind of connection to your storage, you will need to make that connection 
fully transactional. That doesn’t sound at all trivial in your case.

At this point it seems to me that extending DatasetGraphMap (and implementing 
GraphMaker and Graph instead of TupleTable) might be a more appropriate design 
for your work. You can put dynamic loading behavior in Graph (or a GraphView 
subtype) just as easily as in TupleTable subtypes. Are there reasons around the 
use of transactionality in your work that demand the particular semantics 
supported by DSGInMemory?

---
A. Soroka
The University of Virginia Library

On Feb 13, 2016, at 5:18 AM, Joint <[email protected]> wrote:



Hi.
The quick full scenario is a distributed DaaS which supports queries, updates, 
transforms and bulkloads. Andy Seaborne knows some of the detail because I 
spoke to him previously. We achieve multiple writes by having parallel 
Datasets, both traditional TDB and on demand in memory. Writes are sent to a 
free dataset, free being not in a write transaction. That's a simplistic 
overview...
Queries are handled by a dataset proxy which builds a dynamic dataset based on 
the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find 
method to issue the query to all known Datasets and return the union of 
results. Various dataset proxies exist, some load TDBs, others load TTL files 
into graphs, others dynamically create tuples. The common thing being they are 
all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can 
result in multiple Datasets being loaded to satisfy the query.
Nodes can be preloaded which then load Datasets to satisfy finds. This way the 
system can be scaled to handle increased work loads. Also specific nodes can be 
targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's structure. So 
urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
foo to be dynamically loaded into memory along with the quads which are 
required to satisfy the find.
Typically a group of people will be working on a set of data so the first to 
query will load the dataset then it will be accessed multiple times. There will 
be an initial dynamic load of data which will tail off with some additional 
loading over time.
Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?
A typical scenario would be to search all SDAI repository's for some key 
information then load detailed information in some, continuing to drill down.
Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've already 
shimed the DGTriplesQuads so the actual caching code already exists and should 
bed easy to hook on.
Dick

-------- Original message --------
From: "A. Soroka" <[email protected]>
Date: 12/02/2016  11:07 pm  (GMT+00:00)
To: [email protected]
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory

Okay, I’m more confident at this point that you’re not well served by 
DatasetGraphInMemory, which has very strong assumptions about the speedy 
reachability of data. DSGInMemory was built for situations when all of the data 
is in core memory and multithreaded access is important. If you have a lot of 
core memory and can load the data fully, you might want to use it, but that 
doesn’t sound at all like your case. Otherwise, as far as what the right 
extension point is, I will need to defer to committers or more experienced 
devs, but I think you may need to look at DatasetGraph from a more 
close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for 
example.

Can you tell us a bit more about your full scenario? I don’t know much about 
STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of 
queries are you going to be using with this data? How quickly are users going 
to need to switch contexts between datasets?

---
A. Soroka
The University of Virginia Library

On Feb 12, 2016, at 2:44 PM, Joint <[email protected]> wrote:



Thanks for the fast response!
     I have a set of disk based binary SDAI repository's which are based on 
ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
particular my files are IFC2x3 files which can be +1Gb. However after 
processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP 
file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get 
+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP 
files...
Typically only a small subset of the STEP file needs to be queried but 
sometimes other parts need to be queried. Hence the on demand caching and 
DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
call the native SDAI find methods based on the node URI's in the case of a 
cache miss, calling the add methods for the minted tuples, then passing on the 
call to the super find. The underlying SDAI repository's are static so once a 
subject is cached no other work is required.
As the DatasetGraphInMemory is commented as very fast quad and triple access it 
seemed a logical place to extend. The shim cache would be set to expire entries 
and limit the total number of tuples power repository. This is currently 
deployed on a 256Gb ram device.
In the bigger picture l have a service very similar to Fuseki which allows 
SPARQL requests to be made against Datasets which are either TDB or SDAI cache 
backed.
What was DatasetGraphInMemory created for..? ;-)
Dick

-------- Original message --------
From: "A. Soroka" <[email protected]>
Date: 12/02/2016  6:21 pm  (GMT+00:00)
To: [email protected]
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory

I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
better answered by other folks who are more familiar with Jena's DatasetGraph 
implementations, or may actually not have anything to do with DatasetGraph (see 
below for why). I will try to give some background information, though.

There are several paths by which where DatasetGraphInMemory can be performing 
finds, but they come down to two places in the code, QuadTable:: and 
TripleTable::find and in default operation, the concrete forms:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100

for Quads and

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99

for Triples. Those methods are reused by all the differently-ordered indexes 
within Hex- or TriTable, each of which will answer a find by selecting an 
appropriately-ordered index based on the fixed and variable slots in the find 
pattern and using the concrete methods above to stream tuples back.

As to why you are seeing your methods called in some places and not in others, 
DatasetGraphBaseFind features methods like findInDftGraph(), 
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the 
methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a 
selection between those methods— that is done by DatasetGraphBaseFind. So that 
is where you will find the logic that should answer your question.

Can you say a little more about your use case? You seem to have some efficient 
representation in memory of your data (I hope it is in-memory— otherwise it is 
a very bad choice to subclass DSGInMemory) and you want to create tuples on the 
fly as queries are received. That is really not at all what DSGInMemory is for 
(DSGInMemory is using map structures for indexing and in default mode, uses 
persistent data structures to support transactionality). I am wondering whether 
you might not be much better served by tapping into Jena at a different place, 
perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the 
right choice, just implementing Quad- and TripleTable and using the constructor 
DatasetGraphInMemory(final QuadTable i, final TripleTable t).

---
A. Soroka
The University of Virginia Library

On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]> wrote:

Hi.

Does anyone know the "find" paths through DatasetGraphInMemory please?

For example if I extend DatasetGraphInMemory and override
DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
* where {?s ?p ?o}" however if I override the other
DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
?o}}" does not trigger a breakpoint i.e. I don't know what method it's
calling (but as I type I'm guessing it's optimised to return the HexTable
nodes...).

Would I be better off overriding HexTable and TriTable classes find methods
when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
one of these methods?

I need to know the root find methods so that I can shim them to create
triples/quads before they perform the find.

I need to create Triples/Quads on demand (because a bulk load would create
~100M triples but only ~1000 are ever queried) and the source binary form
is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
than quads.

Regards Dick Murray.

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to