Re: more predictable Turtle output

2017-08-24 Thread Élie Roux

Le 02/08/2017 à 18:39, Élie Roux a écrit :
Subclassing ShellGraph and overriding the methods like 
writePredicateObjectList would be my approach. Too much is private
to really subclass - you'll need to copy the class at them moment. 
Along with registration, then at least you can have both in the

same JVM and it helps testing.


Thank you very much for your advise, I've started doing that and it 
seems to work just fine, I'll report if I encounter problems.


Following up on this: I developped a small library allowing us to have a
stable turtle output (at least to some extent, but we do not currently
need more), it's available on

https://github.com/BuddhistDigitalResourceCenter/jena-stable-turtle

and on maven. Ideas/contributions welcome!

Thank you,
--
Elie


Re: more predictable Turtle output

2017-08-02 Thread Élie Roux
Subclassing ShellGraph and overriding the methods like 
writePredicateObjectList would be my approach. Too much is private to

 really subclass - you'll need to copy the class at them moment.
Along with registration, then at least you can have both in the same
JVM and it helps testing.


Thank you very much for your advise, I've started doing that and it
seems to work just fine, I'll report if I encounter problems.

Thank you,
--
Elie


Re: more predictable Turtle output

2017-08-02 Thread Andy Seaborne



On 02/08/17 13:31, Élie Roux wrote:

Le 02/08/2017 à 14:13, Jean-Marc Vanel a écrit :

Élie,

I would use N-Triples format, sorted in alphanumerical order.


Thank you very much for your answer! I thought about this approach but I
see two problems:

- NTRIPLE is hardly readable and I would prefer having my data stored as
TURTLE for readability

- more importantly, this will still output a lot of diff noise because
blank node IDs will change randomly (and will not keep the same order)


Only if you reload the file ... in which case it is a different blank node.

The NT writer uses the internal label for the blank node so if the blank 
node label is changing, suggesting the file is reloaded.


This is most serious for subjects because they will be wildly far apart 
whereas (block writer) triples are locally grouped. Sorting by subject 
would need to define the comparison based on something - maybe a primary 
key value?


Dumping a TDB database (which is N-Quads) shows he label is stable if 
the source is stable.


Andy



Thank you,


Re: more predictable Turtle output

2017-08-02 Thread Andy Seaborne

See also JENA-1262

1/
If you want predicable output, you may be better off starting from 
TurtleWriterBlocks, not the full pretty writer. Or the flat writer 
TurtleWriterFlat which is N-triples+prefixes. Depends on how extensive 
the changes are (and how big the data is) as whether that's easier.  It 
passes out chunks of same-subject triples.


In practice, all triples with the same subject come in one chunk because 
of indexing.  Also, if the data does not change, I think all writers are 
deterministic and produce the same output from run to run.


It's a balance of reducing the effort needed, prettiness, and stability. 
They are not independent choices!


2/
There is no need to change RIOTLib -- create your own writer and 
register it. ExRIOT_out3 has an example of adding a writer.


3/
The pretty writer is in ShellGraph - accTriples is only used in a few 
points and does not really drive the pretty writer output. 
Particularly, the order is lost because the collection of triples is 
further worked on, including sorting by predicate.  See 
writePredicateObjectList.


Subclassing ShellGraph and overriding the methods like 
writePredicateObjectList would be my approach. Too much is private to 
really subclass - you'll need to copy the class at them moment. Along 
with registration, then at least you can have both in the same JVM and 
it helps testing.


4/
I have seen a writer (not open source) that applies a form of 
Floyd-Walshall alorithm to sort subjects get some stability - connected 
nodes tend to come out together so localising git diffs. Quite space 
hungry, quite complicated.


5/ The JSON-LD output comes from an external library.

Andy

On 02/08/17 12:43, Élie Roux wrote:

Hello,

I'm currently trying to solve a problem I have in Turtle: I would like
my output to stay stable, so that it can live on a git without
generating too much diff noise every time the data is regenerated. One
example would be something like:

bdr:G844  a  :Place ;
 :placeContains   bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
bdr:G3JT12502 bdr:G4885 .

for which I have no guarantee that the list will stay in the same order
if the same model is serialized again. I could turn it into a list:

bdr:G844  a  :Place ;
 :placeContains   ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
bdr:G3JT12502 bdr:G4885 ) .

but that changes my data model, and I don't really need that, as I care
about the order only in serialized documents, not in the dataset itself.

I can hack the output of JSON-LD to do this kind of things, but with
Turtle this looks impossible.

I realize that Turtle doesn't guarantee order and I have no problem with
that. I'm also aware that introducing this kind of sorting will always
have caveats.

But I still think it would be a tremendous help for some users if this
kind of sorting was possible. The way I propose to do so is by
introducing the possibility for the user to provide a Comparator
and optionally pass it to org.apache.jena.riot.system.RIOTLib, that
would change the behavior of accTriples() accordingly. That would allow
the current behavior not to change at all, and the new behavior to be
used only by users who would implement a Comparator and thus
know what they're doing and what the limitation of this exercise are.

I'm ready to write the code if the idea is considered a good one, but
would like some opinion first. So what do you think?

Thank you,


Re: more predictable Turtle output

2017-08-02 Thread Élie Roux

Le 02/08/2017 à 14:13, Jean-Marc Vanel a écrit :

Élie,

I would use N-Triples format, sorted in alphanumerical order.


Thank you very much for your answer! I thought about this approach but I
see two problems:

- NTRIPLE is hardly readable and I would prefer having my data stored as
TURTLE for readability

- more importantly, this will still output a lot of diff noise because
blank node IDs will change randomly (and will not keep the same order)

Thank you,
--
Elie


Re: more predictable Turtle output

2017-08-02 Thread Jean-Marc Vanel
Élie,

I would use N-Triples format, sorted in alphanumerical order.

2017-08-02 13:43 UTC+02:00, Élie Roux :
> Hello,
>
> I'm currently trying to solve a problem I have in Turtle: I would like
> my output to stay stable, so that it can live on a git without
> generating too much diff noise every time the data is regenerated. One
> example would be something like:
>
> bdr:G844  a  :Place ;
>  :placeContains   bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
> bdr:G3JT12502 bdr:G4885 .
>
> for which I have no guarantee that the list will stay in the same order
> if the same model is serialized again. I could turn it into a list:
>
> bdr:G844  a  :Place ;
>  :placeContains   ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
> bdr:G3JT12502 bdr:G4885 ) .
>
> but that changes my data model, and I don't really need that, as I care
> about the order only in serialized documents, not in the dataset itself.
>
> I can hack the output of JSON-LD to do this kind of things, but with
> Turtle this looks impossible.
>
> I realize that Turtle doesn't guarantee order and I have no problem with
> that. I'm also aware that introducing this kind of sorting will always
> have caveats.
>
> But I still think it would be a tremendous help for some users if this
> kind of sorting was possible. The way I propose to do so is by
> introducing the possibility for the user to provide a Comparator
> and optionally pass it to org.apache.jena.riot.system.RIOTLib, that
> would change the behavior of accTriples() accordingly. That would allow
> the current behavior not to change at all, and the new behavior to be
> used only by users who would implement a Comparator and thus
> know what they're doing and what the limitation of this exercise are.
>
> I'm ready to write the code if the idea is considered a good one, but
> would like some opinion first. So what do you think?
>
> Thank you,
> --
> Elie
>


-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui


more predictable Turtle output

2017-08-02 Thread Élie Roux

Hello,

I'm currently trying to solve a problem I have in Turtle: I would like
my output to stay stable, so that it can live on a git without
generating too much diff noise every time the data is regenerated. One
example would be something like:

bdr:G844  a  :Place ;
:placeContains   bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
bdr:G3JT12502 bdr:G4885 .

for which I have no guarantee that the list will stay in the same order
if the same model is serialized again. I could turn it into a list:

bdr:G844  a  :Place ;
:placeContains   ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478
bdr:G3JT12502 bdr:G4885 ) .

but that changes my data model, and I don't really need that, as I care
about the order only in serialized documents, not in the dataset itself.

I can hack the output of JSON-LD to do this kind of things, but with
Turtle this looks impossible.

I realize that Turtle doesn't guarantee order and I have no problem with
that. I'm also aware that introducing this kind of sorting will always
have caveats.

But I still think it would be a tremendous help for some users if this
kind of sorting was possible. The way I propose to do so is by
introducing the possibility for the user to provide a Comparator
and optionally pass it to org.apache.jena.riot.system.RIOTLib, that
would change the behavior of accTriples() accordingly. That would allow
the current behavior not to change at all, and the new behavior to be
used only by users who would implement a Comparator and thus
know what they're doing and what the limitation of this exercise are.

I'm ready to write the code if the idea is considered a good one, but
would like some opinion first. So what do you think?

Thank you,
--
Elie