Re: more predictable Turtle output
Le 02/08/2017 à 18:39, Élie Roux a écrit : Subclassing ShellGraph and overriding the methods like writePredicateObjectList would be my approach. Too much is private to really subclass - you'll need to copy the class at them moment. Along with registration, then at least you can have both in the same JVM and it helps testing. Thank you very much for your advise, I've started doing that and it seems to work just fine, I'll report if I encounter problems. Following up on this: I developped a small library allowing us to have a stable turtle output (at least to some extent, but we do not currently need more), it's available on https://github.com/BuddhistDigitalResourceCenter/jena-stable-turtle and on maven. Ideas/contributions welcome! Thank you, -- Elie
Re: more predictable Turtle output
Subclassing ShellGraph and overriding the methods like writePredicateObjectList would be my approach. Too much is private to really subclass - you'll need to copy the class at them moment. Along with registration, then at least you can have both in the same JVM and it helps testing. Thank you very much for your advise, I've started doing that and it seems to work just fine, I'll report if I encounter problems. Thank you, -- Elie
Re: more predictable Turtle output
On 02/08/17 13:31, Élie Roux wrote: Le 02/08/2017 à 14:13, Jean-Marc Vanel a écrit : Élie, I would use N-Triples format, sorted in alphanumerical order. Thank you very much for your answer! I thought about this approach but I see two problems: - NTRIPLE is hardly readable and I would prefer having my data stored as TURTLE for readability - more importantly, this will still output a lot of diff noise because blank node IDs will change randomly (and will not keep the same order) Only if you reload the file ... in which case it is a different blank node. The NT writer uses the internal label for the blank node so if the blank node label is changing, suggesting the file is reloaded. This is most serious for subjects because they will be wildly far apart whereas (block writer) triples are locally grouped. Sorting by subject would need to define the comparison based on something - maybe a primary key value? Dumping a TDB database (which is N-Quads) shows he label is stable if the source is stable. Andy Thank you,
Re: more predictable Turtle output
See also JENA-1262 1/ If you want predicable output, you may be better off starting from TurtleWriterBlocks, not the full pretty writer. Or the flat writer TurtleWriterFlat which is N-triples+prefixes. Depends on how extensive the changes are (and how big the data is) as whether that's easier. It passes out chunks of same-subject triples. In practice, all triples with the same subject come in one chunk because of indexing. Also, if the data does not change, I think all writers are deterministic and produce the same output from run to run. It's a balance of reducing the effort needed, prettiness, and stability. They are not independent choices! 2/ There is no need to change RIOTLib -- create your own writer and register it. ExRIOT_out3 has an example of adding a writer. 3/ The pretty writer is in ShellGraph - accTriples is only used in a few points and does not really drive the pretty writer output. Particularly, the order is lost because the collection of triples is further worked on, including sorting by predicate. See writePredicateObjectList. Subclassing ShellGraph and overriding the methods like writePredicateObjectList would be my approach. Too much is private to really subclass - you'll need to copy the class at them moment. Along with registration, then at least you can have both in the same JVM and it helps testing. 4/ I have seen a writer (not open source) that applies a form of Floyd-Walshall alorithm to sort subjects get some stability - connected nodes tend to come out together so localising git diffs. Quite space hungry, quite complicated. 5/ The JSON-LD output comes from an external library. Andy On 02/08/17 12:43, Élie Roux wrote: Hello, I'm currently trying to solve a problem I have in Turtle: I would like my output to stay stable, so that it can live on a git without generating too much diff noise every time the data is regenerated. One example would be something like: bdr:G844 a :Place ; :placeContains bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 bdr:G3JT12502 bdr:G4885 . for which I have no guarantee that the list will stay in the same order if the same model is serialized again. I could turn it into a list: bdr:G844 a :Place ; :placeContains ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 bdr:G3JT12502 bdr:G4885 ) . but that changes my data model, and I don't really need that, as I care about the order only in serialized documents, not in the dataset itself. I can hack the output of JSON-LD to do this kind of things, but with Turtle this looks impossible. I realize that Turtle doesn't guarantee order and I have no problem with that. I'm also aware that introducing this kind of sorting will always have caveats. But I still think it would be a tremendous help for some users if this kind of sorting was possible. The way I propose to do so is by introducing the possibility for the user to provide a Comparator and optionally pass it to org.apache.jena.riot.system.RIOTLib, that would change the behavior of accTriples() accordingly. That would allow the current behavior not to change at all, and the new behavior to be used only by users who would implement a Comparator and thus know what they're doing and what the limitation of this exercise are. I'm ready to write the code if the idea is considered a good one, but would like some opinion first. So what do you think? Thank you,
Re: more predictable Turtle output
Le 02/08/2017 à 14:13, Jean-Marc Vanel a écrit : Élie, I would use N-Triples format, sorted in alphanumerical order. Thank you very much for your answer! I thought about this approach but I see two problems: - NTRIPLE is hardly readable and I would prefer having my data stored as TURTLE for readability - more importantly, this will still output a lot of diff noise because blank node IDs will change randomly (and will not keep the same order) Thank you, -- Elie
Re: more predictable Turtle output
Élie, I would use N-Triples format, sorted in alphanumerical order. 2017-08-02 13:43 UTC+02:00, Élie Roux: > Hello, > > I'm currently trying to solve a problem I have in Turtle: I would like > my output to stay stable, so that it can live on a git without > generating too much diff noise every time the data is regenerated. One > example would be something like: > > bdr:G844 a :Place ; > :placeContains bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 > bdr:G3JT12502 bdr:G4885 . > > for which I have no guarantee that the list will stay in the same order > if the same model is serialized again. I could turn it into a list: > > bdr:G844 a :Place ; > :placeContains ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 > bdr:G3JT12502 bdr:G4885 ) . > > but that changes my data model, and I don't really need that, as I care > about the order only in serialized documents, not in the dataset itself. > > I can hack the output of JSON-LD to do this kind of things, but with > Turtle this looks impossible. > > I realize that Turtle doesn't guarantee order and I have no problem with > that. I'm also aware that introducing this kind of sorting will always > have caveats. > > But I still think it would be a tremendous help for some users if this > kind of sorting was possible. The way I propose to do so is by > introducing the possibility for the user to provide a Comparator > and optionally pass it to org.apache.jena.riot.system.RIOTLib, that > would change the behavior of accTriples() accordingly. That would allow > the current behavior not to change at all, and the new behavior to be > used only by users who would implement a Comparator and thus > know what they're doing and what the limitation of this exercise are. > > I'm ready to write the code if the idea is considered a good one, but > would like some opinion first. So what do you think? > > Thank you, > -- > Elie > -- Jean-Marc Vanel http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me Déductions SARL - Consulting, services, training, Rule-based programming, Semantic Web +33 (0)6 89 16 29 52 Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
more predictable Turtle output
Hello, I'm currently trying to solve a problem I have in Turtle: I would like my output to stay stable, so that it can live on a git without generating too much diff noise every time the data is regenerated. One example would be something like: bdr:G844 a :Place ; :placeContains bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 bdr:G3JT12502 bdr:G4885 . for which I have no guarantee that the list will stay in the same order if the same model is serialized again. I could turn it into a list: bdr:G844 a :Place ; :placeContains ( bdr:G1183 bdr:G229 bdr:G2CN10883 bdr:G3478 bdr:G3JT12502 bdr:G4885 ) . but that changes my data model, and I don't really need that, as I care about the order only in serialized documents, not in the dataset itself. I can hack the output of JSON-LD to do this kind of things, but with Turtle this looks impossible. I realize that Turtle doesn't guarantee order and I have no problem with that. I'm also aware that introducing this kind of sorting will always have caveats. But I still think it would be a tremendous help for some users if this kind of sorting was possible. The way I propose to do so is by introducing the possibility for the user to provide a Comparator and optionally pass it to org.apache.jena.riot.system.RIOTLib, that would change the behavior of accTriples() accordingly. That would allow the current behavior not to change at all, and the new behavior to be used only by users who would implement a Comparator and thus know what they're doing and what the limitation of this exercise are. I'm ready to write the code if the idea is considered a good one, but would like some opinion first. So what do you think? Thank you, -- Elie