[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread ArielGlenn
ArielGlenn added a comment.


  In T199121#5684397 , 
@daniel wrote:
  
  > In T199121#5684250 , 
@ArielGlenn wrote:
  >
  >> 
  
  ...
  
  >> https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473222/ which I have 
not looked at for even a second (sorry!)
  >
  > Ah, right - that would allow people to try out 0.11 in Special:Export 
before we make it the default. It doesn't prevent us from generating dumps in 
0.11.
  > The big question is - do we need to provide both for a while, so people 
have time to adjust to 0.11? It's technically a breaking change.
  
  If only we had enough hardware/short enough runtimes that we *could* provide 
them both for awhile. But as things are, no go on that. We'll just switch when 
the time comes.
  
  > On the other hand, most consumers wouldn't even notice. I feel we could 
just switch over. Who owns that decision?
  
  As usual no one has given anyone any decision making powers so I'll claim it 
:-P  I should make a task, announce it on the mailing lists, give  a 2 month 
window, and then we should flip the switch. Given the various holidays coming 
up I guess that makes it end of Jan, though I would vote for letting folks get 
back from the staff meeting first.

TASK DETAIL
  https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: Nuria, mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, 
JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, 
ArielGlenn, daniel, darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, 
Ramsey-WMF, GoranSMilovanovic, RazeSoldier, Lunewa, QZanden, V4switch, 
LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, D3r1ck01, Scott_WUaS, Izno, 
SBisson, Wong128hk, gnosygnu, Perhelion, Wikidata-bugs, matthiasmullie, aude, 
GWicke, Bawolff, jayvdb, fbstj, Fabrice_Florin, santhosh, Jdforrester-WMF, 
Ladsgroup, Matanya, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread daniel
daniel added a comment.


  In T199121#5684250 , 
@ArielGlenn wrote:
  
  > In T199121#5684237 , 
@daniel wrote:
  >
  >> In T199121#5683594 , 
@ArielGlenn wrote:
  >>
  >>> Not yet. There are unmerged patches yet, and we'll need to announce well 
in advance before rolling out the change.
  >>
  >> I am not aware of any unmerged patches blocking the use of the 0.11 
schema. Can you point me to them?
  >
  > https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473222/ which I have 
not looked at for even a second (sorry!)
  
  Ah, right - that would allow people to try out 0.11 in Special:Export before 
we make it the default. It doesn't prevent us from generating dumps in 0.11.
  
  The big question is - do we need to provide both for a while, so people have 
time to adjust to 0.11? It's technically a breaking change.
  
  On the other hand, most consumers wouldn't even notice. I feel we could just 
switch over. Who owns that decision?

TASK DETAIL
  https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, daniel
Cc: Nuria, mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, 
JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, 
ArielGlenn, daniel, darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, 
Ramsey-WMF, GoranSMilovanovic, RazeSoldier, Lunewa, QZanden, V4switch, 
LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, D3r1ck01, Scott_WUaS, Izno, 
SBisson, Wong128hk, gnosygnu, Perhelion, Wikidata-bugs, matthiasmullie, aude, 
GWicke, Bawolff, jayvdb, fbstj, Fabrice_Florin, santhosh, Jdforrester-WMF, 
Ladsgroup, Matanya, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread ArielGlenn
ArielGlenn added a comment.


  In T199121#5684237 , 
@daniel wrote:
  
  > In T199121#5683594 , 
@ArielGlenn wrote:
  >
  >> In T199121#5682911 , 
@Nuria wrote:
  >>
  >>> I see this ticket is resolved but the dumps on commons have version  
version="0.10" since from this ticket i gather that the dumps that contain 
those slots are version=11 , are those being produced?
  >>
  >> Not yet. There are unmerged patches yet, and we'll need to announce well 
in advance before rolling out the change.
  >
  > I am not aware of any unmerged patches blocking the use of the 0.11 schema. 
Can you point me to them?
  
  https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473222/ which I have not 
looked at for even a second (sorry!)

TASK DETAIL
  https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: Nuria, mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, 
JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, 
ArielGlenn, daniel, darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, 
Ramsey-WMF, GoranSMilovanovic, RazeSoldier, Lunewa, QZanden, V4switch, 
LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, D3r1ck01, Scott_WUaS, Izno, 
SBisson, Wong128hk, gnosygnu, Perhelion, Wikidata-bugs, matthiasmullie, aude, 
GWicke, Bawolff, jayvdb, fbstj, Fabrice_Florin, santhosh, Jdforrester-WMF, 
Ladsgroup, Matanya, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread daniel
daniel added a comment.


  In T199121#5683594 , 
@ArielGlenn wrote:
  
  > In T199121#5682911 , 
@Nuria wrote:
  >
  >> I see this ticket is resolved but the dumps on commons have version  
version="0.10" since from this ticket i gather that the dumps that contain 
those slots are version=11 , are those being produced?
  >
  > Not yet. There are unmerged patches yet, and we'll need to announce well in 
advance before rolling out the change.
  
  I am not aware of any unmerged patches blocking the use of the 0.11 schema. 
Can you point me to them?

TASK DETAIL
  https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, daniel
Cc: Nuria, mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, 
JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, 
ArielGlenn, daniel, darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, 
Ramsey-WMF, GoranSMilovanovic, RazeSoldier, Lunewa, QZanden, V4switch, 
LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, D3r1ck01, Scott_WUaS, Izno, 
SBisson, Wong128hk, gnosygnu, Perhelion, Wikidata-bugs, matthiasmullie, aude, 
GWicke, Bawolff, jayvdb, fbstj, Fabrice_Florin, santhosh, Jdforrester-WMF, 
Ladsgroup, Matanya, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-21 Thread ArielGlenn
ArielGlenn added a comment.


  In T199121#5682911 , 
@Nuria wrote:
  
  > I see this ticket is resolved but the dumps on commons have version  
version="0.10" since from this ticket i gather that the dumps that contain 
those slots are version=11 , are those being produced?
  
  Not yet. There are unmerged patches yet, and we'll need to announce well in 
advance before rolling out the change.

TASK DETAIL
  https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: Nuria, mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, 
JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, 
ArielGlenn, daniel, darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, 
Ramsey-WMF, GoranSMilovanovic, RazeSoldier, Lunewa, QZanden, V4switch, 
LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, D3r1ck01, Scott_WUaS, Izno, 
SBisson, Wong128hk, gnosygnu, Perhelion, Wikidata-bugs, matthiasmullie, aude, 
GWicke, Bawolff, jayvdb, fbstj, Fabrice_Florin, santhosh, Jdforrester-WMF, 
Ladsgroup, Matanya, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-29 Thread ArielGlenn
ArielGlenn added a comment.
I've updated the rfc proposal removing all 'use=optional' markup as described above. This should probably get brought up during last call at this week's TechCom to make sure everyone's ok with it.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-22 Thread ArielGlenn
ArielGlenn added a comment.
I've added the Sha1 as an attribute of the Text element in all slots; this means removing it as an element from the Content element. I think the new markup reflects these changes correctly, please double-check though.

As to 'optional' or 'required', I'd like to remove all notations of 'optional' and add a comment near the top of the schema reminding folks that attributes by default are optional.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-19 Thread kchapman
kchapman added a comment.
TechCom has moved Last Call to end Wednesday 31 October 2pm PST (21:00 UTC, 22:00 CET). This is due to no TechCom meeting next week to vote on final approval.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, kchapmanCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-18 Thread daniel
daniel added a comment.
Looks good to me!TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-18 Thread ArielGlenn
ArielGlenn added a comment.
OK, it looks like everyone's weighed in. so I'll suggest:


  308722154
  ...
  text/x-wiki
 <-- contains sha1 of content in main slot
  a9kdtqq3buy5tribez2u0ad4b6fdxq2   <-- revision sha1

I'm not excited about the attribute but I like it better than an extra tag.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-16 Thread Halfak
Halfak added a comment.
For clarity, I was originally advocating that we didn't combine any hashes and that instead we provided a  tag in each of the  slots.  I now see that we're going to make a mess in favor of backwards compatibility.  So there will continue to be a  tag and  at the top of the  tree anyway.  (Note that this now differs from the mwapi structure that is forcing everything into "content" and throwing loud warnings if you expect text to appear where it used to be).   In that case, I'd like the structure of the mediawiki database to be reflected here.  E.g. 's  should reflect the rev_sha1 field.

As for the sha1 of the main slot ('s ), I don't have a strong opinion.  Attributes and tags require the same code complexity to deal with when processing the XML dumps.  If you decide on a tag, I do think that "content-sha1" is needlessly ambiguous.  It could be "text-sha1" to reflect the name of the  tag or maybe "main-sha1" to reflect the concept of the "main slot".TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-15 Thread daniel
daniel added a comment.
As the above conversation seems to converge on having the combined hash in the revision level  tag, this raises the question where to put the main slot's content hash. I personally prefer to use an attribute on the  tag, but we could also go for  or .TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-15 Thread daniel
daniel added a comment.
@Halfak to clarify - you originally said the top level  tag should be the hash of the main slot's content, but you now let yourself be convinced that it's more useful to have the combined revision hash there?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-12 Thread mako
mako added a comment.
I think it would be more surprising to have the same SHA1 for two different (in any way) revisions than it would to have a SHA1 that reflects something other than the SHA1 of the primary/historical content field. So I guess I like the revision-level 
 reflecting the hash of all parts of the revision.

As @Halfak sugguests, checksums in the dumps currently aren't always reproducible from the text of all dumps (admittedly, these exceptions are rare and this fact is not widely known) so it would have been unwise to have been counting on this in the past.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, makoCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread Tgr
Tgr added a comment.
@Halfak the questions is: given that XML readers are somewhat flexible, so a script that was written some time ago and has no knowledge of MCR will still be able to read MCR dumps and see all the fields it expects, and it will quietly ignore non-main slots as unknown fields, what should such a script see at the location where it looks for the sha1 of the revision?

We could put the revision sha1 there, in which case it would be an accurate flag for reverts but would not match the actual sha1 of the "revision content" (as far as such a legacy script understands, "revision content" is just the main slot) when there are additional slots. Also diffs that the script expects to be non-empty (since the sha1 is different) might in fact be empty (since there is no change in the main slot).

Or we could put the main slot sha1 there, in which case it would continue to be the sha1 of the "revision content" (as a legacy script understands it) and the sha1 would change if and only if the "revision content" (main slot) changes. But there could be two successive revisions with the same sha1 (which wasn't possible before) and in general there would be lots and lots of revisions with the same sha1 (when someone edits a slot but not the main revision) so the sha1 would be rather useless for revert detection.

The question is, which is less likely to break or confuse old clients? Now that I wrote this up it seems pretty obvious that the first option (using the combined sha1 in the place of the old sha1) is superior.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread Halfak
Halfak added a comment.
re. the "revision hash", it seems that this has already been determined so I'm not sure what other insights I might give.  But FWIW, the combined rev_sha1 seems very crazy :)

If there is combined rev_sha1 that is built by any strategy (crazy or not) from the database, then it should probably also exist in the XML.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread ArielGlenn
ArielGlenn added a comment.

In T199121#4643642, @daniel wrote:
Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the  element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other attributes but omitted for id is confusing, though, and should be fixed.


On the topic of 'optional' vs 'required' attributes, we have 13 attributes in the 0.10 schema, 2 of which are required, 7 of which are marked as optional explicitly and 4 of which are unmarked and so optional anyways. If we are going to clean things up we should either mark them all or remove 'use="optional"' from all. The mixture we currently have is subject to misinterpretation, as you point out.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment.
What is the "revision hash"?

A combined hash the identifies the content of the slot across all revisions. It's stored in rev_sha1 in the database, and used by stuff on labs to detect manual reverts. It's also available from the API as rvprop=sha1.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Halfak
Halfak added a comment.
where do we put the revision hash?

What is the "revision hash"?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Tgr
Tgr added a comment.
Just add a  tag?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment.
Ok, so if we use the  tag on the revision level for the main tag hash, where do we put the revision hash? Or do we just ignore it? Note that the revision has is in the database, and is exposed via the API. It's not made up for the purpose of the dumps.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread FaFlo
FaFlo added a comment.
Of course checksums make lot of of sense for countless use cases, including many in research (mentioned paper was never intended to make a sweeping point to the contrary, but yes, discussion for another time).   
And I think MCR is awesome, JFTR.

Regarding the question at hand, I would feel that "this-is-the-text-tag-hash semantics" would provide more value (from my research-bubble perspective), even if it breaks convention. Then, every single part has an individual sha (just as before with the single text part, now there are just more parts). I subscribe to the we-are-breaking-it-anyway view.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, FaFloCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment.
Well, we are trying to keep compatibility for most clients that just ignore stuff they don't know. They would still be able to process the dumps as before. Removing the  tag would be cleaner and safer, but it would also be a hard B/C break. So, IF we keep it, should we keep it with the this-is-the-revision-hash semantics? or with the this-is-the-text-tag-hash semantics?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Halfak
Halfak added a comment.
Oh.  I guess my sense is to drop the existing  in favor of the 's related to individual content slots.   It doesn't make sense anymore, and we're breaking the schema anyway to add new content slots.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment.
@Halfak having per-slot hashes is not controversial, the question is what to do with the  tag that currently exists on the revision level. If we make this the has of the main slot, we will break the assumption that two revisions that have the same hash there have the same content, and thus constitute a revert. Any consumer making that assumption would then miss-detect reverts.

To me, it seems better to use the existing  for the combined revision-level hash, and have a new hash attribute in each of the  tags, because this would avoid braking the "same hash, same content, must be a revert" assumption. But it would break the "this hash is a hash of the bytes the text text tag" assumption. But I don't know of any consumer that actually relies on this.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread ArielGlenn
ArielGlenn added a comment.

In T199121#4643036, @daniel wrote:



...

Anything using the existing revision level sha1 for revert detection will miss-detect a revert (or a null-edit) for *all* revisions that did not affect the main slot.  While analysis on the slot level may be useful, existing analysis is on the revision level (by definition - slots are new). So it seems reasonable to keep revision-level semantics intact.

Whatever we do, we should definitely include both hashes (main slot and revision), to make the distinction obvious, and the path forward clear, and give consumers the option to change their code to consistently do the thing they want. Only if both hashes are available can we support both kinds of consumers - those that focus on the revision as a whole, and those that focus on individual slots.

Yes, I think we do need and want both sha1 values in the output.

A more philosophical reason to maintain the revision level logic: the "content of the revision" is the combined content of all slots. The intended semantics is not "slots are optional miscellany bits and pieces". The intended semantics is "the revision's content consists of multiple slots".

So, in essense: we would break the assumption that two revisions that have the same hash have the same content. We'd also break consistency with revision hashes reported the API.

I understand and agree with the idea that a reivision consists of the content of all of its slots.
What I'm getting at is that folks have until now been studying article or other page content. Sure, there hasn't been other content available for them to examine, but I imagine that a vast majority of folks will still be interested primarily in article content and how it changes over time, as opposed to , say, considering reverts also of various structured data entries for media files. And folks looking at article reverts and expecting to just pick those up will get a bunch of extra entries if they rely on the rev sha1 once other slots have content in them. It's not unreasonable to ask folks who are interested in the content *also* of other slots to check those sha1s specifically, or to look at a new entry which contains *specifically* the rev sha1.

The problem is that there's no nested level where we can put another sha1 tag. We could introduce a new element revsha1, though I don't like it much. We could move both sha1s into attributes of their respective tags (text and revision), which I like even less.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment.
Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the  element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other attributes but omitted for id is confusing, though, and should be fixed.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment.
I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content of each slot separately; if you want to know if an edit was reverted, you can look at the sha1 of the individual slots, and I imagine that much analysis will focus on slot-level activity rather than the set of slots for a given revision.

Anything using the existing revision level sha1 for revert detection will miss-detect a revert (or a null-edit) for *all* revisions that did not affect the main slot.  While analysis on the slot level may be useful, existing analysis is on the revision level (by definition - slots are new). So it seems reasonable to keep revision-level semantics intact.

Whatever we do, we should definitely include both hashes (main slot and revision), to make the distinction obvious, and the path forward clear, and give consumers the option to change their code to consistently do the thing they want. Only if both hashes are available can we support both kinds of consumers - those that focus on the revision as a whole, and those that focus on individual slots.

A more philosophical reason to maintain the revision level logic: the "content of the revision" is the combined content of all slots. The intended semantics is not "slots are optional miscellany bits and pieces". The intended semantics is "the revision's content consists of multiple slots".TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment.
(Sorry for the near-stream-of-consciousness updates here, just trying to Get **It Done.) To get the sha1 discussion started:

I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content  of each slot separately; if you want to know if an edit was reverted, you can look at the sha1 of the individual slots, and I imagine that much analysis will focus on slot-level activity rather than the set of slots for a given revision. Until we have content in multiple slots, the value will be the same, so it, like everything else, will become an issue next month when Structured Data on Commons goes live. For that reason I'd put the main slot content sha1 in the revision sha1 tag and not even bother with the computed revision sha1 described above.

Please poke holes in this so that we can reach agreement as soon as possible :-)TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment.
Note that text location and id attrs might be omitted in the case where the text is deleted (in practice, this is what the code does), and yet id has never been marked as optional, so I am not marking location as optional either.

I'd rather argue that the spec should be fix to reflect reality. I see no reason to continue spreading a lie ;)TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment.
https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Schema This has now been updated.


In T199121#4641212, @daniel wrote:
Side note re the id attribute becoming optional: hasn't it always be (formally) optional, because it was only emitted for stubs?


Note that text location and id attrs might be omitted in the case where the text is deleted (in practice, this is what the code does), and yet id has never been marked as optional, so I am not marking location as optional either. It might not pay to look too closely at the innards of these things, rather like sausages...

Next up is dealing with the revision vs main slot sha1.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment.
Side note re the id attribute becoming optional: hasn't it always be (formally) optional, because it was only emitted for stubs?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment.
I'll make the changes agreed upon in last night's meeting to the RFC a bit later today and will note here when they are done.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment.
Following up on the deduplication issue raised above:

The main concern about bloat with the new schema, as I understand it, is that it may be common for only one slot's content to change, in which case we don't want to write out the content for the other slots. This is easy enough to deal with, by just keeping the information for the last revision's slots and texts around when writing out the metadata (stubs) or the full dump in the case of Special:Export; if the slot origin of the new content is the slot revision id we read/request and write the content, otherwise we know it's a duplicate and we write an empty text tag which contains, say, , akin to the tag we generate in stubs for texts which have deleted content, i.e. .  There's no need to do full deduplication.

This does mean we have entries for  content, origin (required so that the dumps consumer knows which revision contains the full copy of the content), role, model, format, sha1 for every slot's content regardless, or about 250 extra bytes per entry. For 52 million pages, as we currently have on wikidatawiki, that adds up to 13G or so uncompressed, which is pretty reasonable.

I also agree that this should be version 12, but with an eye on growth in size (and slowness) of dumps production; if we have a bunch of bots inserting structured data on Commons and there's a lot of bloat all of a sudden, version 12 might need to be moved up earlier than we plan.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment.
Quick addendum to @Tgr's last point:

in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts)

Reverts are not a new problem, and not the largest problem. Inherited slots are: If a page has three slots, but only one of the is frequently edited, the content of the other two will be copied into the output over and over, for every revision. In the database, this is deduplicated using the "inheritance" mechanism, but in the dump, no such mechanism exists yet.


not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?


One approach I could image ins:


always output the location attribute (instead of only using it in stub mode and omitting it in full mode)
when generating output for a given page, keep a ledger of the content location already emitted for that page.
when encountering a content location that was already emitted, only emit a sub version of the  tag, instead of the full content.


For a single page, the total number of different locations should be small enough to not cause a problem with the size of the ledger. But if we want to make sure, the ledger can be made "leaky", allowing for false negatives. This way, deduplication may not be perfect, but duplicate output would still be rare, while enforcing  a cap on the ledger size. This approach is used by Wikibase when emitting RDF dumps, it has an implementation of such a "leaky ledger" in HashDedupeBag, using truncated hashes as keys.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread Tgr
Tgr added a comment.
Some issues that we did not have time to fully discuss during the meeting:


sha1 B/C. There are two candidates for the old sha1 field: the sha1 of the main slot and the sha1 of the full revision (which is computed as taking the base36 sha1 of slot 1, concatenating the raw value of slot 2, taking the sha1 of that, concatenating slot 3, taking the sha1 of that etc). When there is just the main slot, the two coincide, so no B/C issue with that. When there are multiple slots, which is the more B/C friendly option? sha1 is mainly used for revert detection, but how would an application that is only aware of the main slot define reverts?
Backfilling the id. Legacy clients expect a numerical text id for the main slot; when we switch to external storage, we'll have an URL instead (which will be exposed as a different attribute). Can we provide some id still so old clients don't break? The actual value of the id is internal detail and not relevant to clients; what's relevant is that it should either be unique or (preferably) hash-like (ie. only main slots with the same text have the same id). In theory we could use some kind of content hash (first N digits of the base10 sha1, for example) but it will not fit into a 32 bit integer (current max text ID is about 900 million; fake IDs should avoid any range that can be, or could be within reasonable time, a text ID; max value for a signed int32 is around 2 billion) - is that likely enough to break clients to make the whole idea moot?
in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts) - not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?
TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-03 Thread kchapman
kchapman added a comment.
TechCom hosted an IRC meeting on this today:


minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.html
log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.log.html
TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, kchapmanCc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-01 Thread ArielGlenn
ArielGlenn added a comment.
This will be discussed at the TechCom meeting Wednesday, October 3rd at 2pm PST(21:00 UTC, 23:00 CET). The announcement was sent to Wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090881.html

I have also sent email about it to the xmldatadumps-l mailing list.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-27 Thread daniel
daniel added a comment.
I'm reluctant to change anything about the formatting of the main slot this round. There are tools that convert xml dumps to sql suitable for import, and the text id may be used by some of these as a convenience for constructing the text table entry.

Such tools will have to be rewritten to support the new MCR schema anyway...TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-27 Thread ArielGlenn
ArielGlenn added a comment.

In T199121#4619450, @daniel wrote:



...

Sounds mostly good, except that we have to bump the XML schema when we want to do T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production). If we made id optional & deprecated right away, we wouldn't have to do that. Perhaps this is one of the questions to answer during the RFC discussion.

Let's talk about it there, indeed. I'd prefer to be explicit about when the 'id' attribute is going away, rather than marking it optional indefinitely.

Also, I don't see why location should be optional. Why not just always output it? Afterall, it's what we want people to use in the future, why not make it easy for them to use it now? Though I'm not sure who actually consumes these attributes, and for what purpose (other than generating a full dump from a stub dump). The values are purely internal.

I'm reluctant to change anything about the formatting of the main slot this round. There are tools that convert xml dumps to sql suitable for import, and the text id may be used by some of these as a convenience for constructing the text table entry.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread daniel
daniel added a comment.
I've added the location attr as optional for the main slot, with the intent that it gets used when we have some other schema than 'tt' in play. The next version would have the id attr as optional + deprecated for the main slot, only permitted to be written for blobs with the 'tt' schema, and the location attr as mandatory. How does that sound?

Sounds mostly good, except that we have to bump the XML schema when we want to do T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production). If we made id optional & deprecated right away, we wouldn't have to do that. Perhaps this is one of the questions to answer during the RFC discussion.

Also, I don't see why location should be optional. Why not just always output it? Afterall, it's what we want people to use in the future, why not make it easy for them to use it now? Though I'm not sure who actually consumes these attributes, and for what purpose. The values are purely internal.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread ArielGlenn
ArielGlenn added a comment.
I wasn't happy with the different formats for the same attribute name either, using a different name for the content text id is grand!  Using anyURI makes for a simpler schema too.

I've added the location attr as optional for the main slot, with the intent that it gets used when we have some other schema than 'tt' in play. The next version would have the id attr as optional + deprecated for the main slot, only permitted to be written for blobs with the 'tt' schema, and the location attr as mandatory.  How does that sound?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread daniel
daniel added a comment.
Hi Ariel! Thanks for updating. Sorry for the confusion about the address format, I thought you knew this was already in. This is what the content_address field contains. I tried to explain the idea in August, see T199121#4529353.

Main slot still only gets an id number for text id (this is required for backwards compat)

This would break as soon as we start addressing external store entries directly - at that point, there will be no corresponding rows in the text table any more.

Perhaps this can be solved by using a new "address" attribute for the full address, and provide the old id attribute on the main slot's text tag (or even all text tags) if possible (that is, if the address is a "tt:" address). The id should then be marked deprecated and optional (or should become optional in the next version of the spec - but then we'd have to fail the export when encountering a non-tt address).

Leaving the id attribute as int, and introducing a new address attribute for the full address seems cleaner in any case. Using different formats for the id attribute of the text tag depending on where the text tag is used seems very confusing.

Is that pattern regex in the schema ok?

It's a bit too restrictive. One thing that should definitely work there are addresses pointing to external store, which requires flags to be embedded in the address, resulting in something like this: ex:DB://cluster5/7633454/89?utf8. To provide maximum  flexibility, I'd just go for full URIs per rfc3986.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread ArielGlenn
ArielGlenn added a comment.
I wish I'd had a pointer to the changes in Storage/Blobstore.php and SqlBlobStore.php earlier, didn't realize this new form of text addressing was baked in already. Anyways, I've revised the draft accordingly, sorry for the long delay. Main slot still only gets an id number for text id (this is required for backwards compat); text id in content element gets schema:location format. Is that pattern regex in the schema ok?

Per slot suppression should still be ok even for 'main'; if the main slot is suppressed then we want to output a text element  just as we do now if the entire revision is suppressed.

Hopefully that addresses the last of the concerns. Please let me know if you see anything amiss.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-21 Thread daniel
daniel added a comment.
I just realized that the proposed dump format is still using numeric text IDs. That cannot be guaranteed to work, text blobs are now identified by URL-like blob addresses: "tt:12345" is the address of text row 12345, and we may start using "ext:DB:..." for ExternalStore soon.

So, instead of   we need to use . The numeric form could still be supported for backwards compatibility, with the prefix "tt:" being assumed if none is given.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-19 Thread ArielGlenn
ArielGlenn added a comment.
Sorry, I've been trying to get a couple other things off my plate. Will be coming back to this shortly.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-19 Thread daniel
daniel added a comment.
I'd really like to move this forward. Ideally, we'd get the new dump format into the 1.32 release. @ArielGlenn is there anything holding this back? Can we have an IRC discussion on this soon?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, D3r1ck01, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, Wesalius, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-27 Thread daniel
daniel added a comment.
we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an external store), would be supported along with the external store. We also need to support installations that do not enable MCR. What that all looks like should be dealt with then. Until then, it's ok to continue to use the text ids as is.

Not enabling MCR is not a thing - it's not necessarily use MCR, but 1.32 will use the MCR storage schema, with content addresses instead of text ID, for all storage. The old schema will be supported  for B/C, but I'd want it to be dropped in 1.33. rev_text_id will then no longer exist (and if it exists, it will be 0). The content address is an integral part of the new storage schema as deployed, the live sites will start relying on it soon, though rev_text_id will still be written for B/C until nothing uses it anymore.

So, for the new schema, I'd highly recommend to remove any mention of a numeric text ID, in favor of an opaque content address. See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/443608/14 for the patch that makes the dump infrastructure use addresses instead of text IDs. Review would be appreciated, since that patch blocks the SDC project. (the patch currently fails CI, but that failure is inherited from the parent patch - working on it).TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-27 Thread ArielGlenn
ArielGlenn added a comment.
That was very helpful, thanks.

Okay, here's my take.  At some point in the future (unknown when), we might lose the text table; we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an external store), would be supported along with the external store. We also need to support installations that do not enable MCR. What that all looks like should be dealt with then. Until then, it's ok to continue to use the text ids as is.

Because there's some doubt about which element a given parser may grab if it expects the element to be unique (see T199121#4494050), I'd rather avoid having multiple text elements and use the safer, if less satisfying, mixed format for now, of an unwrapped text element tag formatted the same old way, and new content tags which will be unlooked for and hopefully ignored by existing tools.This also avoids duplicating revision information into the text attributes for the main slot for the sake of backwards compatibility.   I agree that when we come up to another breaking change in the future, we should revisit the mixed format.

It's true that most information we provide about the content element really will be attributes of the content in a given slot and not pieces of data somehow bundled up as part of the content, so it would more properly be represented as attributes rather an as revision child elements. We could move the attributes into the content tag thus:



for first-pass ('stub') dumps, or like this:

stuff goes here...
... 


for second-pass (revision content) dumps.
In the case that the revision has been suppressed/deleted, we could produce the following:



for either pass of the dumps. We would do this for each slot (except 'main', which would be formatted as a plain old boring text element in the way currently done), since suppression/deletion is at the revision level only, not per slot.

What do folks think about the above?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-24 Thread daniel
daniel added a comment.
Question: what is address="tt:12345" in the alternative proposal here https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps ?

That's the replacement for the text id, for use with stub dumps. We now have a blob store that resolves url-style addresses. The idea is the same as for text id, and the "tt" prefix means "text table", so "tt:1234" is the same as text id 1234. We may in the future however have different storage mechanism. The most obvious would be to link to externalstore directly and bypass the text table.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, stebsco, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-23 Thread ArielGlenn
ArielGlenn added a comment.
I've updated the draft RFC to include solt role name instead of slot role id. I'm still thinking about the multiple text element with attributes, though leaning against it.

Question: what is address="tt:12345"  in the alternative proposal here https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps ?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, stebsco, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-14 Thread daniel
daniel added a comment.
question about possible changes to slot role names

Slot role names are public identifiers, they cannot change, ever. Same for model names.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-14 Thread ArielGlenn
ArielGlenn added a comment.
I've updated the draft RFC to remove the 'final' schema, leaving the 'transitional' schema as the new schema proposal; I've munged the 'header changes' section leaving my question about possible changes to slot role names in there for comment. Still thinking about @daniel's proposal (a series of text tags with attributes).TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread ArielGlenn
ArielGlenn added a comment.
Thanks for the input so far!

I'm just as happy not to expose content model ids. Relying on slot role names exclusively while hiding the ids makes me a little nervous, only because I fear those names may be changed in the future from time to time. Dumps processors would see such a change without realizing that the slot id is in fact the same (and so the change is only cosmetic). Cn someone say a little bit more about these names?

My main reason for the two stage format was that I hate the mixed format approach (main text is formatted this way but the rest is formatted differently). If folks think avoiding this not worth the headache of having two format changes, I can drop the final format.

I'm still thinking about whether multiple text tags for the remaining slots, as per @daniel, is desirable or not.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment.
As for role ids -- perhaps we should primarily use the names, not the numbers, in the  bit. It's analogous to a page's  reference (a primary identifier) not to its  or  (which are provided informatively if you want to repro the database exactly, but can be freely discarded when doing imports and such).TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, brionCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment.
Ok, proposed transitional schema looks like it imports cleanly via importDump (which uses same code path as Special:Import). The proposed final schema, however, imports a revision with empty text (and throws a notice on Undefined index: text in /vagrant/mediawiki/includes/import/WikiImporter.php on line 886).

I know it feels less elegant, but I would recommend sticking with the back-compatible 'transitional' schema and not doing a second transition to minimize impact on interop issues.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, brionCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment.
My concern with the two-step transition idea is that some consumers may not update on a reliable schedule, or may not be able to do so easily. For instance, if people are using Special:Export on one wiki and Special:Import'ing those pages on another that's *not* a Wikimedia-hosted site, it's more likely to be an older version of MediaWiki.

Should test Special:Import on the proposed transitional schema to make sure what happens. :)TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, brionCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment.
SAX parsers would typically take the last and get confused

True, that's a risk - they would then be using the wrong slot.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread Tgr
Tgr added a comment.

In T199121#4493904, @daniel wrote:
And I don't see how it's better than the multiple-text-tags-with-attributes approach.


How safe is the assumption that an XML parsing library that expects a tag to be unique will just ignore extra occurrences and return the first tag (as opposed to, say, returning the last one or raising an error)? I imagine DOM parsers would typically take the first and SAX parsers would typically take the last and get confused, and multi-gigabyte documents are more likely to be parsed in a SAX fashion.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment.
CR-aware clients will maybe need very slightly less elegant parsing code, which does not seem like a bad trade-off.

It's an option, but it feels like we'd be shifting the paint to consumers, who have to maintain that duplicate logic forever. And I don't see how it's better than the multiple-text-tags-with-attributes approach.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread Tgr
Tgr added a comment.
Or just drop the idea of a transitional format and keep the main slot one level higher forever (or at least until the next change that really needs to be a BC break). Old clients will not break, MCR-aware clients will maybe need very slightly less elegant parsing code, which does not seem like a bad trade-off.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment.
Introducing a  tag that wraps a  tag and additional tags for meta-data looks nice and clean, but it's also completely incompatible with existing consumers.

Just generating multiple  tags, and addition any new info as attributes on them, would allow existing consumer code that follows the "grab the content of the (first) text tag" to continue functioning.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment.
I'm not sure the transitional format is useful. What client would use it, and what for? A client that wants to process the transitional format would need to implement handling for two different XML structures, and actually use both in parallel. A client that still relies on the old format doesn't benefit from the new tags begin present.

I see two ways for fully backwards compatible transition:


duplicate the contents of the main slot, using the old and the new format.
generate old-style and new-style dumps in parallel.


But both  are probably impractical, because they need too much space.

So, in practice, I suggest to make a hard breaking change  immediately, forcing clients that want to process new dumps to use the new structure. This could be made easier though, see my next comment.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread ArielGlenn
ArielGlenn added a comment.
The draft at https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps is ready for a first round of comments by people on this ticket (or people just following along). Anything from 'that schema is wrong' to 'why did you use that color for diffs' to 'this wording is confusing', whatever you think is useful.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, kostajh, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-08 Thread daniel
daniel added a comment.
@ArielGlenn I'm happy to be a co-author, I just can't invest much time into this right now. Thanks for taking care of this!TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-08 Thread ArielGlenn
ArielGlenn added a comment.
@daniel I am about to steal a bunch of your preliminary work and comments in order to craft a workable proposal (see the link in the task description now); for this reason you are listed as a co-author of the RFC. If you would rather not, please say so and I'll just give a ton of credit in the document itself.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment.
Because MCR content on Commons, and specifically the metadata storage piece, is set to go live on October 1st, and we likely will barely have an RFC out by that time if we are lucky, we will not be giving adoptees much time to convert their existing utilities to use the new schema. So I am initially leaning towards something like this:


Initial schema change plus code just to get it out the door -- something that looks backwards compatible for main slot only, and existing tools may skip the 'extra' fields containing information about content in other slots
In another 4 months, let's say, a second schema change goes out which treats all slots and all content identically, requiring all downstream dumps users to have completed conversion of their tools by that date. This prejudges the final schema here and assumes we want all content and all slots to have the same format in the dumps; if people feel strongly otherwise, I'm open to discussion.


What do people think about an approach like this?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment.
Making clear here the correspondence between revisions, slots, content, text, and comparing that to the previous setup with just revisions and text.

Until now: page -> one or more revisions, each revision linked to exactly one entry in text table, each entry in text table linked to one piece of content in external storage (multiple pieces of content might be stored together, or they might be stored as diffs against an earlier piece of content but let's skip over that).
One entry in the text table is not generally be used for more than one revision but it does happen (example: newNullRevision created when a page is protected).

MCR: page -> one or more revisions, each revision linked to one or more entries in slot table, each entry in slot table linked to exactly one piece of content in external storage (same caveats as above).
What happens to the text table?  Does it become unused as all content is migrated into the MCR tables?
One entry in the content table can be used for more than one slot. This may be common, as content in a specific slot may not be changed by edit of a revision. An entry in the slot table can be linked to only one revision.

Please comment if this does not match your understanding of plain or MCR implementations.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment.
Some initial comments/questions on slot-roles, content_models tables:

We could dump the slot_roles and content_models tables separately if we wanted to; the mechanism for adding those is easy enough. But we might also/instead want to add them to the header which appears both in current revision content and in stubs metadata dumps. This header currently contains the mapping between namespace ids and names, and it might not be a bad idea for someone to have these model and role mappings too. But maybe they aren't needed, since they never appear in any content directly, as opposed to namespace names, which appear in page title links and so on.

Is it likely that:


the names may vary in their meaning from one wiki to another (content model A means one thing on commons and another on wikidata)?
the names may change over time on one wiki (content role id nnn was called A and now is called B)?
TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment.
I'm adding here the tables and fields that need to be part of the dumps, both for export and for import, so everyone is on the same page.

slots:


slot_revision_id
slot_role_id
slot_content_id
slot_origin


content:


content_id
content_size
content_sha1
content_model  (id, not name)
content_address


slot_roles:


role_id
role_name


content_models;


model_id
model_name


As far as I know these are all the additional fields and tables. @daniel please correct me if I'm wrong.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread daniel
daniel added a comment.

In T199121#4437696, @ArielGlenn wrote:
This is surely the wrong place to drop a link, but it can always be moved. Have some thoughts about MCR, two-pass dumps, and the XML schema: https://www.mediawiki.org/wiki/User:ArielGlenn/MCR_and_dumps


It's totally the right place :) See me response at https://www.mediawiki.org/wiki/User_talk:ArielGlenn/MCR_and_dumpsTASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread ArielGlenn
ArielGlenn added a comment.
Notes on timing: it i expected that commons MCR writes (metadata for media) will be happening by Oct 1, so it would be really really nice to have an rfc approved and code written by then. That's a pretty short time frame given that it's summer vacation right now, but otherwise data about some media won't show up n the dujmps.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread ArielGlenn
ArielGlenn added a comment.
Existing proposals (which were the occasion for my comments at the link above): https://www.mediawiki.org/wiki/Multi-Content_Revisions/DumpsTASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, gnosygnu, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-07-19 Thread ArielGlenn
ArielGlenn added a comment.
This is surely the wrong place to drop a link, but it can always be moved. Have some thoughts about MCR, two-pass dumps, and the XML schema: https://www.mediawiki.org/wiki/User:ArielGlenn/MCR_and_dumpsTASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-07-09 Thread daniel
daniel added a comment.
Note: Not part of the MVP for SDC.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: danielCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, JJMC89, Agabi10, Susannaanas, SBisson, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, GWicke, jayvdb, Ricordisamoa, fbstj, Lydia_Pintscher, Fabrice_Florin, Raymond, santhosh, Jdforrester-WMF, Steinsplitter, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs