Just a point of info: I'm _pretty_ sure that we're talking about TripleTable, 
not TriTable. TriTabl is an impl class (implementing TripleTable) that uses 
three TripleTables to index, well, triples. TripleTable (and its sibling, 
QuadTable) are the interfaces that, I think, we are interested in possibly 
generalizing and making more public.

As Andy knows, I tried hard to unify Triple- and QuadTable under a supertype 
TupleTable, but the fact is that Java doesn't really do variable arity very 
well and we didn't want to mess with very core types like Quad or Triple, so 
the method dealing with tuples by elements (::find) stayed in the 
specialization, but methods dealing with the tuple as a whole (e.g. ::add) got 
pushed up. I think Andy has done a nice job below bringing everything together 
in a simple, straightforward way. org.apache.jena.sparql.core.mem could be 
rewritten very quickly to use this instead of the current types, if that's any 
evidence.


---
A. Soroka
The University of Virginia Library

> On Mar 10, 2016, at 7:08 AM, Andy Seaborne <[email protected]> wrote:
> 
> Hi Dick,
> 
> Thanks for the details.
> 
> So TriTable is used as the internal implementation of a caching read-only 
> graph and you're using the loop form for GRAPH (and often the loop is one URI 
> - i.e. directed to one part of the data).  Using TriTable is because it's a 
> convenient triple storage for the use case.
> 
> The two interesting pieces to Jena:
> 
> 1/ support for writing dynamic adapters
> 
> 2/ a graph (DatasetGraph) implementation that more clearly has an interface 
> for storage.
> 
> 
> On the latter: I've come across this before and sketched this interface.
> 
> It's nothing more than a first pass sketch.  Is this the sort of thing that 
> might work for your use case? (a graph storage version with quads over the 
> top as a subcase):
> 
> interface StorageRDF {
>    default void add(Triple triple) { .... }
>    default void add(Quad quad)     { .... }
> 
>    default void delete(Triple triple)  { .... }
>    default void delete(Quad quad)      { .... }
> 
>    void add(Node s, Node p, Node o) ;
>    void add(Node g, Node s, Node p, Node o) ;
> 
>    void delete(Node s, Node p, Node o) ;
>    void delete(Node g, Node s, Node p, Node o) ;
> 
>    /** Delete all triples matching a {@code find}-like pattern */
>    void removeAll(Node s, Node p, Node o) ;
>    /** Delete all quads matching a {@code find}-like pattern */
>    void removeAll(Node g, Node s, Node p, Node o) ;
> 
>    // NB Quads
>    Stream<Quad>   findDftGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   findUnionGraph(Node s, Node p, Node o) ;
>    Stream<Quad>   find(Node g, Node s, Node p, Node o) ;
>    // For findUnion.
>    Stream<Quad>   findDistinct(Node g, Node s, Node p, Node o) ;
> 
>    // triples
>    Stream<Triple> find(Node s, Node p, Node o) ;
> 
> //    default Stream<Triple> find(Node s, Node p, Node o) {
> //        return findDftGraph(s,p,o).map(Quad::asTriple) ;
> //    }
> 
> //    Iterator<Quad>   findUnionGraph(Node s, Node p, Node o) ;
> //    Iterator<Quad>   find(Node g, Node s, Node p, Node o) ;
> 
> 
>    // contains
> 
>    default boolean contains(Node s, Node p, Node o)
>    { return find(s,p,o).findAny().isPresent() ; }
>    default boolean contains(Node g, Node s, Node p, Node o)
>    { return find(g,s,p,o).findAny().isPresent() ; }
> 
>    // Prefixes ??
> }
> 
> 
> https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
> also has the companion DatasetGraphStorage.
> 
>    Andy
> 
> 
> 
> On 04/03/16 12:03, Dick Murray wrote:
>> LOL. The perils of a succinct update with no detail!
>> 
>> I understand the Jena SPI supports read/writes via transactions and I also
>> know that the wrapper classes provide a best effort for some of the
>> overridden methods which do not always sit well when materializing triples.
>> For example DatasetGraphBase provides public boolean containsGraph(Node
>> graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
>> which results in a call to DatasetGraphBaseFind public Iterator<Quad>
>> find(Node g, Node s, Node p, Node o) which might end up with something
>> being called in DatasetGraphInMemory depending on what has been extended
>> and overridden. This causes a problem for me because I shim the finds to
>> decide whether the triples have been materialized before calling the
>> overridden find. After extending DatasetGraphTriples and
>> DatasetGraphInMemory I realised that I had overridden most of the methods
>> so I stopped and implemented DatasetGraph and Transactional.
>> 
>> In my scenario the underlying data (a vendor agnostic format to get
>> AutoCAD, Bentley, etc to work together) is never changed so the
>> DatasetGraph need not support writes. Whilst we need to provide semantic
>> access to the these files they result in ~100M triples each if transformed,
>> there are 1000's of files, they can change multiple times per day and the
>> various disciplines typically only require a subset of triples.
>> 
>> That said in my DatasetGraph implementation if you call
>> begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
>> implementation in that it does not support external writes (throws UOE) but
>> does implement writes internally (via TriTable) because it needs to write
>> the materialized triples to answer the find.
>> 
>> So if we take
>> 
>> select ?s
>> where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
>> <urn:iungo:iso/10303/11/schema/s/entity/e>}
>> 
>> Jena via the SPARQL query engine will perform the following abridged
>> process.
>> 
>>    - Jena begins a DG read transaction.
>>    - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>, ANY,
>>    a <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - DG will;
>>       - check if the repository r has been loaded, i.e. matching the
>>       repository name URI spec fragment to a repository file on disk
>> and loading
>>       it into the SDAI session.
>>       - check if the model m has been loaded, i.e. matching the model name
>>       URI spec fragment to a repository model and loading it into the SDAI
>>       session.
>>          - If we have just loaded the SDAI model check if there is any pre
>>          caching to be done which is just a set of find triples which
>> are handled as
>>          per the normal find detailed following.
>>       - We now have a G which wraps the SDAI model and uses TriTable to
>>    hold materialized triples.
>>    - DG will now call G.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>).
>>    - G will check the find triple against a set of already materialized
>>    find triples and if it misses;
>>       - G will search a set of triple handles which know how to materialize
>>       triples for a given find triple and if found;
>>          - G begins a TriTable write transaction and for {ANY, a
>>          <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
>> are READ but
>>          the G TriTable is WRITE);
>>             - Check the find triples again we might have been in a race for
>>             the find triple and lost...
>>             - Load the correct Java class for entity e which involves
>>             minting the FQCN using the schema s and entity e e.g.
>> ifc2x3 and ifcslab
>>             become org.jsdai.ifc2x3.ifcslab.
>>             - Use this to call the SDAI method findInstances(Class<?
>>             extends Entity> entityClass) which returns zero or more
>> SDAI entities from
>>             which we;
>>                - Query the ifc2x3 schema to list the explicit Entity
>>                attributes and for each we add a triple to TriTable e.g.
>>                ifcslab:ifcorganization =
>>                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>                
>> <urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
>> <urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
>>                - In addition we add the triple
>>                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
>>                <urn:iungo:iso/10303/11/schema/s/entity/e>}.
>>                - If we are creating linked triples (i.e. max depth > 1)
>>                then for each attribute which has a SDAI entity
>> instance value call the
>>                appropriate handle to create the triples.
>>             - G commits the TriTable write transaction (make the triples
>>          visible before we update the find triples!).
>>          - G updates the find triples to include;
>>          - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
>>             - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
>>             ANY ANY}
>>             - Repeat the above for any linked triples created.
>>             - The TriTable now contains the triples required to answer the
>>          find triple.
>>          - G will return TriTable.find(ANY, a
>>    <urn:iungo:iso/10303/11/schema/s/entity/e>)
>>    - Jena ends the DG read transaction.
>> 
>> 
>> Some find triples will result in the appropriate handle being called
>> (handle hit) which will create triples. Others will handle miss and be
>> passed on to the TriTable find (e.g. no triples created and TriTable will
>> return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
>> example because does this mean create all of the triples (+100M) or all of
>> the currently created triples (which relies on having queried what you need
>> to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful to
>> ask this find?
>> 
>> Hope that clear up the "writes are not supported" (the underlying data is
>> read only) and why the TupleTable subtypes are not problematic. I could
>> have held the created triples per find triple but that wouldn't scale with
>> duplication plus why recreate the wheel when if I'm not mistaken TriTable
>> uses the dexx collection giving subsequent HAMT advantages which is what a
>> high performance in memory implementation requires. The solution is working
>> and compared to a fully transformed TDB is giving the correct results. To
>> do might include timing out the G when they have not been accessed for a
>> period of time...
>> 
>> Finally having wrote the wrapper I thought it wouldn't be used anywhere
>> else but subsequently it was used to abstract an existing system where
>> adhoc semantic access was required and it's lined to do a similar task on
>> two other data silos. Hence the question to Andy regarding a Jena cached
>> SPI package.
>> 
>> Thanks again for your help Adam/Andy.
>> 
>> Dick.
>> 
>> 
>> 
>> On 4 March 2016 at 01:36, A. Soroka <[email protected]> wrote:
>> 
>>> I’m confused about two of your points here. Let me separate them out so we
>>> can discuss them easily.
>>> 
>>> 1) "writes are not supported”:
>>> 
>>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>>> out-of-the-box implementations of RDF storage. Can you explain what you
>>> mean by this?
>>> 
>>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>>> triple caching algorithm”:
>>> 
>>> The subtypes of TupleTable with which you are working have exactly the
>>> same kinds of find() methods. Why are they not problematic in that context?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>>> On Mar 3, 2016, at 5:47 AM, Joint <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Andy.
>>>> I implemented the entire SPI at the DatasetGraph and Graph level. It got
>>> to the point where I had overridden more methods than not. In addition
>>> writes are not supported and contains methods which call find(ANY, ANY,
>>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
>>> the TriTable because it fits and quads are spoofed via triple to quad
>>> iterator.
>>>> I have a set of filters and handles which the find triple is compared
>>> against and either passed straight to the TriTable if the triple has been
>>> handled before or its passed to the appropriate handle which adds the
>>> triples to the TriTable then calls the find. As the underlying data is a
>>> tree a cache depth can be set which allows related triples to be cached.
>>> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
>>>> Would you consider a generic version for the Jena code base?
>>>> 
>>>> 
>>>> Dick
>>>> 
>>>> -------- Original message --------
>>>> From: Andy Seaborne <[email protected]>
>>>> Date: 18/02/2016  6:31 pm  (GMT+00:00)
>>>> To: [email protected]
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>>>  DatasetGraphInMemory
>>>> 
>>>> Hi,
>>>> 
>>>> I'm not seeing how tapping into the implementation of
>>>> DatasetGraphInMemory is going to help (through the details
>>>> 
>>>> As well as the DatasetGraphMap approach, one other thought that occurred
>>>> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
>>>> implementation.
>>>> 
>>>> It loads, and clears, the mapped graph on-demand, and passes the find()
>>>> call through to the now-setup data.
>>>> 
>>>>       Andy
>>>> 
>>>> On 16/02/16 17:42, A. Soroka wrote:
>>>>>> Based on your description the DatasetGraphInMemory would seem to match
>>> the dynamic load requirement. How did you foresee it being loaded? Is there
>>> a large over head to using the add methods?
>>>>> 
>>>>> No, I certainly did not mean to give that impression, and I don’t think
>>> it is entirely accurate. DSGInMemory was definitely not at all meant for
>>> dynamic loading. That doesn’t mean it can’t be used that way, but that was
>>> not in the design, which assumed that all tuples take about the same amount
>>> of time to access and that all of the same type are coming from the same
>>> implementation (in a QuadTable and a TripleTable).
>>>>> 
>>>>> The overhead of mutating a dataset is mostly inside the implementations
>>> of TupleTable that are actually used to store tuples. You should be aware
>>> that TupleTable extends TransactionalComponent, so if you want to use it to
>>> create some kind of connection to your storage, you will need to make that
>>> connection fully transactional. That doesn’t sound at all trivial in your
>>> case.
>>>>> 
>>>>> At this point it seems to me that extending DatasetGraphMap (and
>>> implementing GraphMaker and Graph instead of TupleTable) might be a more
>>> appropriate design for your work. You can put dynamic loading behavior in
>>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
>>> Are there reasons around the use of transactionality in your work that
>>> demand the particular semantics supported by DSGInMemory?
>>>>> 
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>> 
>>>>>> On Feb 13, 2016, at 5:18 AM, Joint <[email protected]> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi.
>>>>>> The quick full scenario is a distributed DaaS which supports queries,
>>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail
>>> because I spoke to him previously. We achieve multiple writes by having
>>> parallel Datasets, both traditional TDB and on demand in memory. Writes are
>>> sent to a free dataset, free being not in a write transaction. That's a
>>> simplistic overview...
>>>>>> Queries are handled by a dataset proxy which builds a dynamic dataset
>>> based on the graph URIs. For example the graph URI urn:Iungo:all causes the
>>> proxy find method to issue the query to all known Datasets and return the
>>> union of results. Various dataset proxies exist, some load TDBs, others
>>> load TTL files into graphs, others dynamically create tuples. The common
>>> thing being they are all presented as Datasets backed by DatasetGraph. Thus
>>> a SPARQL query can result in multiple Datasets being loaded to satisfy the
>>> query.
>>>>>> Nodes can be preloaded which then load Datasets to satisfy finds. This
>>> way the system can be scaled to handle increased work loads. Also specific
>>> nodes can be targeted to specific hardware.
>>>>>> When a graph URI is encountered the proxy can interpret it's
>>> structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
>>> SDAI repository foo to be dynamically loaded into memory along with the
>>> quads which are required to satisfy the find.
>>>>>> Typically a group of people will be working on a set of data so the
>>> first to query will load the dataset then it will be accessed multiple
>>> times. There will be an initial dynamic load of data which will tail off
>>> with some additional loading over time.
>>>>>> Based on your description the DatasetGraphInMemory would seem to match
>>> the dynamic load requirement. How did you foresee it being loaded? Is there
>>> a large over head to using the add methods?
>>>>>> A typical scenario would be to search all SDAI repository's for some
>>> key information then load detailed information in some, continuing to drill
>>> down.
>>>>>> Hope this helps.
>>>>>> I'm going to extend the hex and tri tables and run some tests. I've
>>> already shimed the DGTriplesQuads so the actual caching code already exists
>>> and should bed easy to hook on.
>>>>>> Dick
>>>>>> 
>>>>>> -------- Original message --------
>>>>>> From: "A. Soroka" <[email protected]>
>>>>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>>>>> To: [email protected]
>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>> 
>>>>>> Okay, I’m more confident at this point that you’re not well served by
>>> DatasetGraphInMemory, which has very strong assumptions about the speedy
>>> reachability of data. DSGInMemory was built for situations when all of the
>>> data is in core memory and multithreaded access is important. If you have a
>>> lot of core memory and can load the data fully, you might want to use it,
>>> but that doesn’t sound at all like your case. Otherwise, as far as what the
>>> right extension point is, I will need to defer to committers or more
>>> experienced devs, but I think you may need to look at DatasetGraph from a
>>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
>>> directly, for example.
>>>>>> 
>>>>>> Can you tell us a bit more about your full scenario? I don’t know much
>>> about STEP (sorry if others do)— is there a canonical RDF formulation? What
>>> kinds of queries are you going to be using with this data? How quickly are
>>> users going to need to switch contexts between datasets?
>>>>>> 
>>>>>> ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>> 
>>>>>>> On Feb 12, 2016, at 2:44 PM, Joint <[email protected]> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for the fast response!
>>>>>>>     I have a set of disk based binary SDAI repository's which are
>>> based on ISO10303 parts 11/21/25/27 otherwise known as the
>>> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
>>> be +1Gb. However after processing into a SDAI binary I typically see a size
>>> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
>>> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
>>> 1000's of similar sized STEP files...
>>>>>>> Typically only a small subset of the STEP file needs to be queried
>>> but sometimes other parts need to be queried. Hence the on demand caching
>>> and DatasetGraphInMemory. The aim is that in the find methods I check a
>>> cache and call the native SDAI find methods based on the node URI's in the
>>> case of a cache miss, calling the add methods for the minted tuples, then
>>> passing on the call to the super find. The underlying SDAI repository's are
>>> static so once a subject is cached no other work is required.
>>>>>>> As the DatasetGraphInMemory is commented as very fast quad and triple
>>> access it seemed a logical place to extend. The shim cache would be set to
>>> expire entries and limit the total number of tuples power repository. This
>>> is currently deployed on a 256Gb ram device.
>>>>>>> In the bigger picture l have a service very similar to Fuseki which
>>> allows SPARQL requests to be made against Datasets which are either TDB or
>>> SDAI cache backed.
>>>>>>> What was DatasetGraphInMemory created for..? ;-)
>>>>>>> Dick
>>>>>>> 
>>>>>>> -------- Original message --------
>>>>>>> From: "A. Soroka" <[email protected]>
>>>>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>>>>> To: [email protected]
>>>>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>> DatasetGraphInMemory
>>>>>>> 
>>>>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
>>> may be better answered by other folks who are more familiar with Jena's
>>> DatasetGraph implementations, or may actually not have anything to do with
>>> DatasetGraph (see below for why). I will try to give some background
>>> information, though.
>>>>>>> 
>>>>>>> There are several paths by which where DatasetGraphInMemory can be
>>> performing finds, but they come down to two places in the code, QuadTable::
>>> and TripleTable::find and in default operation, the concrete forms:
>>>>>>> 
>>>>>>> 
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>>>>> 
>>>>>>> for Quads and
>>>>>>> 
>>>>>>> 
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>>>>> 
>>>>>>> for Triples. Those methods are reused by all the differently-ordered
>>> indexes within Hex- or TriTable, each of which will answer a find by
>>> selecting an appropriately-ordered index based on the fixed and variable
>>> slots in the find pattern and using the concrete methods above to stream
>>> tuples back.
>>>>>>> 
>>>>>>> As to why you are seeing your methods called in some places and not
>>> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
>>> make a selection between those methods— that is done by
>>> DatasetGraphBaseFind. So that is where you will find the logic that should
>>> answer your question.
>>>>>>> 
>>>>>>> Can you say a little more about your use case? You seem to have some
>>> efficient representation in memory of your data (I hope it is in-memory—
>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
>>> create tuples on the fly as queries are received. That is really not at all
>>> what DSGInMemory is for (DSGInMemory is using map structures for indexing
>>> and in default mode, uses persistent data structures to support
>>> transactionality). I am wondering whether you might not be much better
>>> served by tapping into Jena at a different place, perhaps implementing the
>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
>>> implementing Quad- and TripleTable and using the constructor
>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>>>>> 
>>>>>>> ---
>>>>>>> A. Soroka
>>>>>>> The University of Virginia Library
>>>>>>> 
>>>>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <[email protected]>
>>> wrote:
>>>>>>>> 
>>>>>>>> Hi.
>>>>>>>> 
>>>>>>>> Does anyone know the "find" paths through DatasetGraphInMemory
>>> please?
>>>>>>>> 
>>>>>>>> For example if I extend DatasetGraphInMemory and override
>>>>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
>>> "select
>>>>>>>> * where {?s ?p ?o}" however if I override the other
>>>>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
>>> {?s ?p
>>>>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
>>> it's
>>>>>>>> calling (but as I type I'm guessing it's optimised to return the
>>> HexTable
>>>>>>>> nodes...).
>>>>>>>> 
>>>>>>>> Would I be better off overriding HexTable and TriTable classes find
>>> methods
>>>>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
>>> end in
>>>>>>>> one of these methods?
>>>>>>>> 
>>>>>>>> I need to know the root find methods so that I can shim them to
>>> create
>>>>>>>> triples/quads before they perform the find.
>>>>>>>> 
>>>>>>>> I need to create Triples/Quads on demand (because a bulk load would
>>> create
>>>>>>>> ~100M triples but only ~1000 are ever queried) and the source binary
>>> form
>>>>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
>>> quads)
>>>>>>>> than quads.
>>>>>>>> 
>>>>>>>> Regards Dick Murray.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 

Reply via email to