| Halfak added a subscriber: FaFlo. Halfak added a comment. |
Seems to me that each slot should have it's own sha1. There's a huge amount of research of non-article content in Wikipedia. I imagine that analysts will be interested in identity reverts (the most common type that are detectable with sha1 histories) in main and any other slots. Having one sha1 represent some concatenation of all the slots isn't as useful as having a sha1 per slot. If I wanted to look for identity reverts across all slots, then I'd simply concatenate the sha1s myself during my analysis.
As for where the sha1's exist in the XML, I'm not sure I have a strong opinion there. It's hard to work out what is being proposed from this enormous Phab task. But from the wiki page, I see <sha1> tags in each <content> "slot" and that makes sense to me. I'm not very worried about having the <sha1> tag because there's already a substantial change to the content structure being proposed.
This is maybe besides the point, but I have many issues with the claims of Flock et al's Revisiting Reverts paper which seems to be driving the research practice away from the use of checksums. In my expert opinion, checksum-based revert detection will be an important measurement strategy for a long future while fine-grained content persistence approaches like those developed by myself and Flock et al, will grow in parallel and not supersede the use of sha1s. I think that @FaFlo, @Denny, and I could have a lot more words about this so I'd encourage taking that off task if y'all are interested in discussing the future of revert detection generally. For the purposes of this task, I think we can agree on sha1s having lasting value for analytics and research broadly.
Cc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
