daniel added subscribers: Denny, vrandezo, Halfak.
daniel added a comment.

What I'm getting at is that folks have until now been studying article or other page content. Sure, there hasn't been other content available for them to examine, but I imagine that a vast majority of folks will still be interested primarily in article content and how it changes over time, as opposed to , say, considering reverts also of various structured data entries for media files. And folks looking at article reverts and expecting to just pick those up will get a bunch of extra entries if they rely on the rev sha1 once other slots have content in them.

In my experience, dump analysis that is interested in reverts typically doesn't care about the content at all. It analyzes how often reverts happen, how they happen, who does them, how long content that is later reverted (and thus assumed to be "bad") remained visible to the public. All this is on the revision level and would break if we changed the semantics of the <sha1> tag. But we shouldn't guess how people use the hash, we should ask them... the trouble with that is: it takes time.

But a quick search on google scholar turns up a few familiar names, like Denny Vrandecic, Aaron Halfaker, Luca de Alfaro. Denny in particular seems like a good candidate to provide insights, as an author of Revisiting reverts: accurate revert detection in wikipedia.

@Denny @vrandezo @Halfak, what's your take on this? Should the <sha1> tag in dumps continue to match the main slot's content, or continue to match the revision's entire content? We will have to break one of these two assumptions...

As to where to put the hash: In my opinion, the content hash that is a sha1 of the serialized content should be an attribute, just like the byte size. It relates to the serialized blob, and should thus be attached to the <text> tag.


TASK DETAIL
https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, daniel
Cc: Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to