Hi Sebastian,
On 14/07/2022 05:27, Sebastian Trueg wrote:
Hello,
trying to consistently get nice git diffs for my turtle files I want to
normalize pull requests. To that end I simply re-write the turtle files
via RDFDataMgr. Sadly the result is not stable, at least the order of
some objects changes from run to run.
Is there a way to ensure that serializing the same set of triples
(parsed from different formats) always results in the exact same output?
Not currently. It would be nice to have and there are a few around but
no contributions made.
The W3C "RDF Dataset Canonicalization and Hash Working Group"
https://w3c.github.io/rch-wg-charter/
is about to start.
A derivative of the Turtle blocks format would be a good starting point.
Or do you also want all the "pretty" forms, like nested [ ] in the
object position? Lists "(....)" - "usually same output" using the core
of the pretty printer class ShellGraph (specifically listSubjects()).
Contributions welcome.
Having nested [ ] means a small change in the graph can lead to a big
change in the output.
The fun begins with blank nodes - reparsing the same file is a different
graph. Changes to blank node labels change hash tables. So changes to
blank nodes change the iteration order of everything in an index.
There is work-in-progress on a new memory graph implementation, focused
on speed and memory efficiency. That doesn't mean we can also have
another graph implementation that has a consistent return order for
Graph.find().
Andy
Output from a TDB database is consistent until the next update occurs.
I found https://github.com/buda-base/jena-stable-turtle but that seems
to not be compatible with 4.5.0.
Thanks,
Sebastian