[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-12 Thread Halfak
Halfak added a subscriber: mako.Halfak added a comment.
Aha!  I don't think anyone is generally comparing the content text to a specific checksum for any reason (except some old studies by @mako  to check to see if the checksums were consistent historically (they aren't).  So, I'm a fan of the combined checksum showing up in place of the main  -- especially if this is already how it is implemented in the DB with rev_sha1.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Halfak
Halfak added a subscriber: FaFlo.Halfak added a comment.
Seems to me that each slot should have it's own sha1.  There's a huge amount of research of non-article content in Wikipedia.  I imagine that analysts will be interested in identity reverts (the most common type that are detectable with sha1 histories) in main and any other slots.  Having one sha1 represent some concatenation of all the slots isn't as useful as having a sha1 per slot.   If I wanted to look for identity reverts across all slots, then I'd simply concatenate the sha1s myself during my analysis.

As for where the sha1's exist in the XML, I'm not sure I have a strong opinion there.  It's hard to work out what is being proposed from this enormous Phab task.  But from the wiki page, I see  tags in each  "slot" and that makes sense to me.  I'm not very worried about having the  tag because there's already a substantial change to the content structure being proposed.

This is maybe besides the point, but I have many issues with the claims of Flock et al's Revisiting Reverts paper which seems to be driving the research practice away from the use of checksums.  In my expert opinion, checksum-based revert detection will be an important measurement strategy for a long future while fine-grained content persistence approaches like those developed by myself and Flock et al, will grow in parallel  and not supersede the use of sha1s.  I think that @FaFlo, @Denny, and I could have a lot more words about this so I'd encourage taking that off task if y'all are interested in discussing the future of revert detection generally.  For the purposes of this task, I think we can agree on sha1s having lasting value for analytics and research broadly.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added subscribers: Denny, vrandezo, Halfak.daniel added a comment.
What I'm getting at is that folks have until now been studying article or other page content. Sure, there hasn't been other content available for them to examine, but I imagine that a vast majority of folks will still be interested primarily in article content and how it changes over time, as opposed to , say, considering reverts also of various structured data entries for media files. And folks looking at article reverts and expecting to just pick those up will get a bunch of extra entries if they rely on the rev sha1 once other slots have content in them.

In my experience, dump analysis that is interested in reverts typically doesn't care about the content at all. It analyzes how often reverts happen, how they happen, who does them, how long content that is later reverted (and thus assumed to be "bad") remained visible to the public. All this is on the revision level and would break if we changed the semantics of the  tag. But we shouldn't guess how people use the hash, we should ask them... the trouble with that is: it takes time.

But a quick search on google scholar turns up a few familiar names, like Denny Vrandecic, Aaron Halfaker, Luca de Alfaro. Denny in particular seems like a good candidate to provide insights, as an author of Revisiting reverts: accurate revert detection in wikipedia.

@Denny @vrandezo @Halfak, what's your take on this? Should the  tag in dumps continue to match the main slot's content, or continue to match the revision's entire content? We will have to break one of these two assumptions...

As to where to put the hash: In my opinion, the content hash that is a sha1 of the serialized content should be an attribute, just like the byte size. It relates to the serialized blob, and should thus be attached to the  tag.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a subscriber: tstarling.daniel added a comment.
@ArielGlenn Thanks for setting this up!

First thing I noticed: I think it would be a bad idea to expose the numeric IDs of content models and slot roles. These numeric IDs are strictly internal to the storage layer. They are not even to be used anywhere in the application logic. The numeric IDs are an implementation detail of a normalization mechanism in the storage layer, and should be in no way be part of any public interface.

In my opinion, the same should have been true for namespace IDs, btw.  Exposing their numeric IDs introduces incompatibilities between wikis that would otherwise simply not exist. I recall @tstarling also arguing in that direction.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a subscriber: awight.ArielGlenn added a comment.
Adding @awight as an interested party (who works on eg the mw vagrant dumps role).TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a subscriber: hoo.ArielGlenn added a comment.
@hoo I am adding you guessing that you will want to weigh in on the new schema. If this is outside your interest, go ahead and take yourself off.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread ArielGlenn
ArielGlenn added a subscriber: brion.ArielGlenn added a comment.
Adding @brion as someone who knows these schemas well (thanks in advance!)TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs