[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Great news! TASK DETAIL https://phabricator.wikimedia.org/T74348 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, ArielGlenn Cc: JanZerebecki, Jimkont, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. The double-check didn't turn anything up either. The dump seems to be clean. TASK DETAIL https://phabricator.wikimedia.org/T74348 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, daniel Cc: JanZerebecki, Jimkont, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
Jimkont added a comment. other examples of old serializations can be found here: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/json/JsonWikiParser.scala#L62-L67 TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, Jimkont Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. @Jimkont: broken serialization of empty lists is a separate issue, unrelated to unconverted old-style serializations. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, daniel Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. I'm now running the following on tool labs to find old serializations: daniel@tools-bastion-01:/public/dumps/public/wikidatawiki/20150330$ bzgrep ',quot;entityquot;:quot;[qQpP][0-9]*quot;\}' wikidatawiki-20150330-pages-meta-history.xml.bz2 | tee ~/wikidatawiki-20150330-pages-meta-history.bad-serialization.txt TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, daniel Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. @JanZerebecki: Redirects are serialized like this: {entity:Q23,redirect:Q42} Old style serialization ends with this: ,entity:q207} So, if you grep for `,quot;entityquot;}`, you should find only old style serializations. Also, old style serialization will contain `quot;labelquot;:{`, while new style should contain `quot;labelsquot;:{` (using lable//s//, plural). TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, daniel Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. Btw, if someone can tell me where to find a full history dump of wikidata, I'd be happy to check this myself. The annoying part here is to download and store the behemoth... TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: hoo, daniel Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
hoo added a comment. @Daniel: Could you have a quick look at this? Looks fixed to me, but I think you're the only one who can tell for sure. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, hoo Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
daniel added a comment. Fore redirects, the encoding {quot;entityquot;} is correct. There is no old encoding for redirects, entity redirects didn't exist when we used the old serialization format. So, searching for quot;entityquot; is not a good indicator for detecting old-style serialization. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, daniel Cc: JanZerebecki, Jimkont, Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Is anyone looking at the redirects serialization? TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. OK, I no longer feel as stupid. The number of items with the 'entity' format is small in comparison to the total number of qualities, we would expect the opposite if old revisions were being kept as is. And as I said I had checked with local testing that the export transform is indeed being called and changing the content. So I had a look at the problematic entries. It turns out that all but 27 are of the form text xml:space=preserve{quot;entityquot;:quot;Q547932quot;,quot;redirectquot;:quot;Q6150957quot;}/text so I guess serializing of redirects needs work. I checked that newly added redirects are dumped with this format. The few remaining matches are likely discussions that happen to include the string; I spot checked some and found that to be the case. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Um, with this format means new redirects are dumped with {quot;entityquot; ... etc. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
hoo added a comment. In https://phabricator.wikimedia.org/T74348#1069658, @ArielGlenn wrote: right. this is what you want; the old style 'entity' is gone, the new style 'descriptions' is present. or am I missing something? To me it seems like the old style entity is still present. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, hoo Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. right. this is what you want; the old style 'entity' is gone, the new style 'descriptions' is present. or am I missing something? TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
hoo added a comment. In https://phabricator.wikimedia.org/T74348#1059331, @ArielGlenn wrote: Hello? Any wikidata dumps consumers on this ticket? Otherwise I'll ask in xmlatadumps-l. In https://phabricator.wikimedia.org/T74348#768660, @daniel wrote: Bumping to critical, since it may result in data loss for clients that cannot process the old style format. We really do not want them to implement that, we changed for a reason... Btw: In order to check for old style serializations, grep for quot;entityquot;. To detect new style serialization, check for quot;descriptionsquot; (plural). hoo@tools-dev:~$ grep -c 'quot;entityquot;' wikidatawiki-20150207-pages-articles.xml 129630 :( TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, hoo Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
Lydia_Pintscher added a comment. @hoo: could you have a look? TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, Lydia_Pintscher Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
hoo added a comment. In https://phabricator.wikimedia.org/T74348#1062351, @Lydia_Pintscher wrote: @hoo: could you have a look? Just kicked of the download of a dump, I'll verify some old revisions once that's done (later today). TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn, hoo Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Hello? Any wikidata dumps consumers on this ticket? Otherwise I'll ask in xmlatadumps-l. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Liuxinyu970226, Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, Jdouglas, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. I ran a series of tests locally and also checked production output. I can verify that the transform is actually applied, the output looks good to me for prefetch or from the database, but a consumer of the data should probably look at it for 5 seconds to verify that the output format is they way you want it. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Wikidata-bugs, Tobi_WMDE_SW, jayvdb, Svick, ArielGlenn, Ricordisamoa, mark, Lydia_Pintscher, jeremyb-phone, daniel, Manybubbles, hoo, RobH, aude, faidon, fgiunchedi, Joe, Dzahn, jeremyb, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Thanks for the patch! I will check it out in the next couple of days. I'm really sorry for the long delay; I've been out for medical reasons and am now trying to get caught up on everything. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Tobi_WMDE_SW, daniel, Ricordisamoa, jayvdb, Svick, Manybubbles, Wikidata-bugs, hoo, Lydia_Pintscher, mark, jeremyb-phone, RobH, aude, Joe, chasemp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
hoo added a comment. ! In T74348#787697, @Lydia_Pintscher wrote: Can I please have a status update on this? Do we know why it is happening? As far as I know the problem is that during dump creation content from the last dump is being scraped in case nothing changed. That's probably fine for wikitext, but of course that bypasses our on-the-fly serialization change. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: ArielGlenn, hoo Cc: Tobi_WMDE_SW, daniel, Ricordisamoa, jayvdb, Svick, Manybubbles, Wikidata-bugs, hoo, Lydia_Pintscher ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74348: Wikidata dumps contain old-style serialization.
ArielGlenn added a comment. Old revisions are indeed read from the old dump, as long as the length of the revision text is correct. And indeed this is a necessity; the db servers cannot handle requests for all revisions anew, and even if they could the dumps would take many times loger to generate as well. The only thing that can be done is a manual run of the specfic pass without prefetch, which will take... as long as it takes. I need to check with Sean (DBA) about it before doing so. TASK DETAIL https://phabricator.wikimedia.org/T74348 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: ArielGlenn Cc: Tobi_WMDE_SW, daniel, Ricordisamoa, jayvdb, Svick, Manybubbles, Wikidata-bugs, hoo, Lydia_Pintscher ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs