Hi! > My view is that this tool should be extremely cautious when it sees new data > structures or fields. The tool should certainly not continue to output > facts without some indication that something is suspect, and preferably > should refuse to produce output under these circumstances.
I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need. > What can happen if the tool instead continues to operate without complaint > when new data structures are seen? Consider what would happen if the tool > was written for a version of Wikidata that didn't have rank, i.e., claim > objects did not have a rank name/value pair. If ranks were then added, > consumers of the output of the tool would have no way of distinguishing > deprecated information from other information. Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics. What I was talking are changes that don't break semantics, and majority of additions are just that. > Of course this is an extreme case. Most changes to the Wikidata JSON dump > format will not cause such severe problems. However, given the current > situation with how the Wikidata JSON dump format can change, the tool cannot > determine whether any particular change will affect the meaning of what it > produces. Under these circumstances it is dangerous for a tool that > extracts information from the Wikidata JSON dump to continue to produce > output when it sees new data structures. The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning. > This does make consuming tools sensitive to changes to the Wikidata JSON > dump format that are "non-breaking". To overcome this problem there should > be a way for tools to distinguish changes to the Wikidata JSON dump format > that do not change the meaning of existing constructs in the dump from those > that can. Consuming tools can then continue to function without problems > for the former kind of change. As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too. -- Stas Malyshev smalys...@wikimedia.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata