Re: Persistent Model Implementation

2018-10-23 Thread Daan Reid

Hi,

It may not fit your use case precisely, but we've had some success 
combining the event-source pattern with caching to create datasets with 
history. The use of Jena's deltas lets us only persist changesets.


This may also be of general interest to the list.

https://github.com/drugis/jena-es

Regards,

Daan Reid

On 22-10-18 12:49, Kevin Dreßler wrote:

Thanks for your quick reply!


On 22. Oct 2018, at 12:19, ajs6f  wrote:

The TIM dataset implementation [1] is backed by persistent data structures (for the 
confused, the term "persistent" here means in the sense of immutable [2]-- it 
has nothing to do with disk storage). However, nothing there goes beyond the 
Node/Triple/Graph/DatasetGraph SPI-- the underlying structures aren't exposed and can't 
be reused by clients.


This looks interesting but I don't think it actually matches my use case. 
However, I think I would want a transactional commit in my implementation to 
improve performance so that I could collect a set of statements and only create 
a new immutable instance of the model when committing all of these together 
instead of after each single statement.


This sounds like an interesting and powerful use case, although I'm not sure 
how easily it could be accomplished within the current API. For one thing, we 
don't have a good way of distinguishing mutable and immutable models in Jena's 
type system right now.

Are the "k new Models" both adding and removing triples? If they're just adding 
triples, perhaps a clever wrapper might work.


Both addition and deletion of triples is possible. But the wrapper idea is nice 
and might actually work for both addition and deletion, as I could try to cache 
a set of Statements that have been deleted as long as this caches size is under 
x% of the base models size.


Otherwise, have you tried using an intermediating caching setup, wherein 
statements that are copied are routed through a cache that prevents 
duplication? I believe Andy deployed a similar technique for some of the TDB 
loading code and saw great improvement therefrom.


I just started researching this so I haven't done anything in this direction. 
Do you believe the wrapper / caching approach would be feasible with the 
current API? I am not very familiar with Jenas implementations but from my 
experience with the API it seems that every RDFNode has a reference to the 
model from which it was retrieved (if any). So in order to not violate API 
contracts I think I would also need to wrap each resource upon retrieval to 
point to the wrapper model instead of the base model?


ajs6f

[1] https://jena.apache.org/documentation/rdf/datasets.html
[2] https://en.wikipedia.org/wiki/Persistent_data_structure


On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  wrote:

Hello everyone,

I have an application using Jena where I frequently have to create copies of 
Models in order to then process them individually, i.e. all triples of one 
source Model are added to k new Models which are then mutated.

For larger Models this obviously takes some time and, more relevant for me, 
creates a considerable amount of memory pressure.
However, with a Model implementation based on persistent data structures I 
could eliminate most of these issues as the amount of data changed is typically 
under 5% compared to the overall Model size.

Has anyone ever done something like this before, i.e. are there immutable Model 
implementations with structural sharing that someone is aware of? If not what 
would be your advice on how one would approach implementing this in their own 
code base?

Best regards,
Kevin





Re: Splitting data into graphs vs datasets

2018-03-22 Thread Daan Reid
I would say that using separate datasets is a good idea if you have sets 
of graphs that just don't belong together. The dataset as an 
organisational, abstract container is an excellent idea, in my opinion.


Regards,

Daan

On 22-03-18 11:22, Mikael Pesonen wrote:
Ok seems that using many datasets is not a good idea. I had no bias and 
not having any issues with speed, just wanted to see what is best way to 
go.


On 21.3.2018 20:48, ajs6f wrote:
  Those sure are good reasons for using named graphs. But what about 
using different datasets too?
Consider that you may not be seeing such reasons because it may not 
actually be as good an idea.


Here's another reason to prefer graphs: There is a standard management 
HTTP API for named graphs: SPARQL Graph Store. There is no equivalent 
for datasets, so each product rolls its own. That's not good for 
flexibility if you have to move products.


As for performance, that will depend radically on the implementation. 
Jena TIM, for example, using hashing for its indexes, so the 
difference between having a lot of quads in a dataset and a few isn't 
likely to be that much. Other impls will vary.


Are you sure that performance is going to be improved by separating 
out datasets? (I.e. is that the measured bottleneck?) Are you now 
having problems with queries accidentally querying data they shouldn't 
see, and can your queries be rewritten to fix that (which might also 
improve performance)? (Jena has a permissions framework that can 
secure information down to the individual triple.)


ajs6f

On Mar 21, 2018, at 6:35 AM, Mikael Pesonen 
 wrote:



Those sure are good reasons for using named graphs. But what about 
using different datasets too?


btw, I couldn't find info on how to run many datasets with Fuseki. is 
it just one dataset per fuseki process? -loc parameter for 
fuseki-server.jar?


Br

On 20.3.2018 14:22, Martynas Jusevičius wrote:
Provenance. With named graphs, it's easier to track where data came 
from:

who imported it, when etc.
You can also have meta-graphs about other graphs.

Also editing and updating data. You can load named graph contents (of
smallish size) in an editor, make changes and then store a new 
version in
the same graph. You probably would not want to do this with a large 
default

graph.

On Tue, Mar 20, 2018 at 1:16 PM, Mikael Pesonen 


wrote:


Hi,

I'm using Fuseki GSP, and so far have put all data into one default
dataset and using graphs to split it.

If I'm right there would be benefits using more than one dataset
- better performance - each query is done inside a dataset so less 
data =

faster query
- protection of data - can't "accidentaly" query data from other 
datasets

Downsides:
- combining data from various datasets is heavier task

Is this correct? Any other things that should be considered?

Thank you

--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and
Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.peso...@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10

FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A

FI-20100 Turku
FINLAND



--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's 
and Writer's Tools - Text Tools - E-books and M-books


Mikael Pesonen
System Engineer

e-mail: mikael.peso...@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND





Re: Example code

2018-03-19 Thread Daan Reid
I'm not sure whether you're after is something that corresponds 
one-to-one between RDF triples and what the user sees, or something more 
functional, but if you want to see an example of using Jena as data 
store for an application, it might be interesting to look at our system 
ADDIS:

https://addis.drugis.org/

We use an RDF data store to persist clinical trials study data in a 
structured manner, then retrieve it with SPARQL and render it in 
angularJS. Code is here:

https://github.com/drugis/addis-core

Resultset to Java:
https://github.com/drugis/addis-core/blob/master/src/main/java/org/drugis/addis/trialverse/service/impl/QueryResultMappingServiceImpl.java

Jena graph to and from usable frontend data objects in JS:
https://github.com/drugis/addis-core/blob/master/src/main/webapp/resources/app/js/outcome/outcomeService.js

Note that the system is somewhat large by now so extracting simple 
examples isn't easy.


Regards,

Daan Reid
http://drugis.org

On 19-03-18 08:31, David Moss wrote:

That is certainly a way to get data from a SPARQL endpoint to display in a 
terminal window.
It does not store it locally or put it into a user-friendly GUI control however.
Looks like I might have to roll my own and face the music publicly if I'm doing 
it wrong.

I think real-world examples of how to use Jena in a user friendly program are 
essential to advancing the semantic web.
Thanks for considering my question.

DM

On 19/3/18, 4:19 pm, "Laura Morales" <laure...@mail.com> wrote:

 As far as I know the only way to query a Jena remotely is via HTTP. So, install Fuseki and 
then send a traditional HTTP GET/POST request to it with two parameters, "query" and 
"format". For example
 
 $ curl --data "format=json=..." http://your-endpoint.org
  
  
 
 Sent: Sunday, March 18, 2018 at 11:26 PM

 From: "David Moss" <admo...@gmail.com>
 To: users@jena.apache.org
 Subject: Re: Example code
 
 On 18/3/18, 6:24 pm, "Laura Morales" <laure...@mail.com> wrote:
 
 >> For example, when using data from a SPARQL endpoint, what is the accepted

 >> way to retrieve it, store it locally and make it available through user
 >> interface controls?
 
 >Make a query that returns a jsonld document.
 
 How? Do you have some example code showing how this query is retrieved, dealt with locally and made available to an end user through a GUI control?

 What I am looking for here is a bridge between what experts glean from 
reading Javadoc and what ordinary people need to use Jena within a GUI based 
application.
 
 I see this kind of example as the missing link that prevents anyone other than expert using Jena.

 So long as easy to follow examples of how to get from an rdf triplestore 
to information displayed on a screen in a standard GUI way are missing, Jena 
will remain a plaything for expert enthusiasts.
 
 DM
 
 
 
 
 
 
 
 
  
 





Delta using lexical equality causing issues

2017-07-07 Thread Daan Reid

Hi all,

After upgrading to Jena 3.2 recently, we encountered the issue 
referenced here:

https://issues.apache.org/jira/browse/JENA-1370

In short, a graph Delta, when adding a triple with a double value 
property that is semantically the same but lexically different to a 
triple already in the graph causes what is in my opinion incorrect 
behaviour: The triple is placed in the deletions list and not in the 
additions list, in essence removing it from the graph even though it 
should still be in there.


So for example if I take a Delta with graph base containing `:s :p 
-1.70e+00 .`
and then clear it and add the triple `:s :p -1.7E0 .`, the resulting 
Delta will have a set of deletions containing the original triple, and 
an empty list of additions.


As far as I can tell this is because Delta uses .contains() to check its 
additions and deletions, and for the volatile GraphMem we use, the 
literals are checked for lexical instead of semantic equality, and this 
causes inconsistencies.


I would appreciate any and all help with this.

Regards,

Daan Reid
Drugis project - https://drugis.org