Re: [Wikidata-l] Broken JSON in XML dumps
Am 27.02.2015 um 12:33 schrieb Dimitris Kontokostas: > Standard XML MW format exists for long time and is supported by existing > software. > IMHO both XML and Json dumps should be treated with the same priority They should, in fact, be using the same code -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
Am 27.02.2015 um 15:33 schrieb Jan Zerebecki: > On 2015-02-27 09:11, Markus Kroetzsch wrote: >> Since the JSON dumps and EntityData exports are (largely) free of >> errors, there is already code for fixing this problem. Maybe we could >> just use this. > > Tracked in: https://phabricator.wikimedia.org/T64188 Replace old > serialization code in lib with datamodel serialization Actually... the XML dump should already be using the new code. The same code in fact that generates the JSON files. The problem is that old revisions get taken directly from old dumps, and do not get re-serialized. I thought we had worked around this, but replicating the elaborate setup MWF uses for generating dumps is hard, so testing is a pain... I think we need a new ticket, or re-open an old ticket, for this. > Also: https://phabricator.wikimedia.org/T73349 Fix empty map > serialization behaviour I thought we fixed this? Not everywhere, I assume -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 27.02.2015 um 08:01 schrieb Lukas Benedix: > AFAIK there is no php involved in the dump process (python?) I thought so too for a while, but it's all PHP as far as I can tell, using the standard WikiExporter class (though in very strange and wonderful ways, have a look at dumpTextPass). - -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQIcBAEBAgAGBQJU8OA+AAoJEBGbJNn1aZjcHIQQAI9aX3o6PjpHWYvR7/RR6Ve5 AHqNNl/aYo3lkOclbKi8j/or8zUaV+I7X7DOUEN5kKCYakdUi9g882gUGriQoGss CY/J4HJoUbjhXGSGguISM52vEQnE1Hu4E+h0zcXaKuP17mNd1fThgUrDZnUhXj+h kPvQYemZ5leu7Cpfcg3/v9+xevweTFXauzLqgHOrK7mQSyu2AJ4i31BkDhlRi9qi yXd3tzzrze3V4TvTeQJufrpfvJ6umy7wunAgqu9jfi2u6zqIQri1TFw1dMVse7hQ /LKNZHkQPIe4T6O9wz6lZBQ9LJ1SkNpvYB4q4/ckVI2nhT41gfu3LRrNsEfc3XRw Trdkdcd0u9byEKCdMA4esFY0pOnuP3MwScU9hPVyytunE/1S9duEC10+uLwrhRVX XEaXphWOsJal+TPN7AZJHi0MlhoJc7KixeTyCnGqcmI3rTZJszTXfctmfh4KQDuA R/V2te8LXsL+UiT8e1ZiyZifoIvw19DTMN0R4tGVdiAdIOQ0UElfhIJN4L8AiIB/ HlIXLvJjrD01lKsaY/mOAdcb4IdxlAt4NU0q6tN95esN1evhZ5X8bezYvfr/qMFI +ndakS1be7bmQSnlSNaqjjmzPY3zK7BqnVUskiN8gHOGjeRmyIYRD99HlcL6WD/f y1kY4XzUhgr0TqL0H6EV =O2Ep -END PGP SIGNATURE- ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
On 27.02.2015 17:47, Lydia Pintscher wrote:
On Thu, Feb 26, 2015 at 2:52 PM, Markus Kroetzsch
wrote:
Hi,
It's that time of the year again when I am sending a reminder that we still
have broken JSON in the dump files ;-). As usual, the problem is that empty
maps {} are serialized wrongly as empty lists []. I am not sure if there is
any open bug that tracks this, so I am sending an email. There was one, but
it was closed [1].
As you know (I had sent an email a while ago), there are some remaining
problems of this kind in the JSON dump, and also in the live exported JSON,
e.g.,
https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json
(uses [] as a value for snaks: this item has a reference with an empty list
of snaks, which is an error by itself)
However, the situation is considerably worse in the XML dumps, which have
seen less usage since we have JSON, but as it turns out are still preferred
by some users. Surprisingly (to me), the JSON content in the XML dumps is
still not the same as in the JSON dumps. A large part of the records in the
XML dump is broken because of the map-vs-list issue.
For example, the latest dump of current revisions [2] has countless
instances of the problem. The first is in the item Q3261 (empty list for
claims), but you can easily find more by grepping for things like
"claims":[]
It seems that all empty maps are serialized wrongly in this dump (aliases,
descriptions, claims, ...). In contrast, the site's export simply omits the
key of empty maps entirely, see
https://www.wikidata.org/wiki/Special:EntityData/Q3261.json
The JSON in the JSON dumps is the same.
Cheers,
Markus
[1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77
[2]
http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2
Sorry Markus. This was still on my agenda but I've been pushing this
off for too long. I'll bring it up in our planning meeting next
Wednesday. If you could open a ticket for it on Phabricator that'd be
awesome.
Done: https://phabricator.wikimedia.org/T91117
Markus
As for general issues with dumps not being generated and so on:
Unfortunately the whole Wikimedia dumps infrastructure has a bus
factor of 1 and this became an issue over the last months.
Improvements for the whole Wikimedia dumps infrastructure are being
tracked at https://phabricator.wikimedia.org/T88991 and the Wikidata
specific improvements are tracked at
https://phabricator.wikimedia.org/T88728 If you have issues that are
not there yet please do file them.
Cheers
Lydia
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
On 2015-02-27 09:11, Markus Kroetzsch wrote: > Since the JSON dumps and EntityData exports are (largely) free of > errors, there is already code for fixing this problem. Maybe we could > just use this. Tracked in: https://phabricator.wikimedia.org/T64188 Replace old serialization code in lib with datamodel serialization Also: https://phabricator.wikimedia.org/T73349 Fix empty map serialization behaviour -- Regards, Jan Zerebecki ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
On Thu, Feb 26, 2015 at 2:52 PM, Markus Kroetzsch
wrote:
> Hi,
>
> It's that time of the year again when I am sending a reminder that we still
> have broken JSON in the dump files ;-). As usual, the problem is that empty
> maps {} are serialized wrongly as empty lists []. I am not sure if there is
> any open bug that tracks this, so I am sending an email. There was one, but
> it was closed [1].
>
> As you know (I had sent an email a while ago), there are some remaining
> problems of this kind in the JSON dump, and also in the live exported JSON,
> e.g.,
>
> https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json
> (uses [] as a value for snaks: this item has a reference with an empty list
> of snaks, which is an error by itself)
>
> However, the situation is considerably worse in the XML dumps, which have
> seen less usage since we have JSON, but as it turns out are still preferred
> by some users. Surprisingly (to me), the JSON content in the XML dumps is
> still not the same as in the JSON dumps. A large part of the records in the
> XML dump is broken because of the map-vs-list issue.
>
> For example, the latest dump of current revisions [2] has countless
> instances of the problem. The first is in the item Q3261 (empty list for
> claims), but you can easily find more by grepping for things like
>
> "claims":[]
>
> It seems that all empty maps are serialized wrongly in this dump (aliases,
> descriptions, claims, ...). In contrast, the site's export simply omits the
> key of empty maps entirely, see
>
> https://www.wikidata.org/wiki/Special:EntityData/Q3261.json
>
> The JSON in the JSON dumps is the same.
>
> Cheers,
>
> Markus
>
>
> [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77
> [2]
> http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2
Sorry Markus. This was still on my agenda but I've been pushing this
off for too long. I'll bring it up in our planning meeting next
Wednesday. If you could open a ticket for it on Phabricator that'd be
awesome.
As for general issues with dumps not being generated and so on:
Unfortunately the whole Wikimedia dumps infrastructure has a bus
factor of 1 and this became an issue over the last months.
Improvements for the whole Wikimedia dumps infrastructure are being
tracked at https://phabricator.wikimedia.org/T88991 and the Wikidata
specific improvements are tracked at
https://phabricator.wikimedia.org/T88728 If you have issues that are
not there yet please do file them.
Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
Standard XML MW format exists for long time and is supported by existing software. IMHO both XML and Json dumps should be treated with the same priority Best, Dimitris On Fri, Feb 27, 2015 at 10:19 AM, Markus Kroetzsch < [email protected]> wrote: > On 26.02.2015 21:40, Martynas Jusevičius wrote: > >> Looks like someone hasn't learned the lesson: >> https://www.mail-archive.com/[email protected]/msg02588.html >> > > No, this post is unrelated. The cause of the problem was correctly > analysed by Stas. > > Markus > > > >> On Thu, Feb 26, 2015 at 9:27 PM, Lukas Benedix >> wrote: >> >>> I second this! >>> >>> >>> btw: what is the status of the problem with the missing dumps with >>> history? (latest available from November 2014) >>> >>> Lukas >>> >>> Am Do 26.02.2015 um 14:52 schrieb Markus Kroetzsch: >>> Hi, It's that time of the year again when I am sending a reminder that we still have broken JSON in the dump files ;-). As usual, the problem is that empty maps {} are serialized wrongly as empty lists []. I am not sure if there is any open bug that tracks this, so I am sending an email. There was one, but it was closed [1]. As you know (I had sent an email a while ago), there are some remaining problems of this kind in the JSON dump, and also in the live exported JSON, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json (uses [] as a value for snaks: this item has a reference with an empty list of snaks, which is an error by itself) However, the situation is considerably worse in the XML dumps, which have seen less usage since we have JSON, but as it turns out are still preferred by some users. Surprisingly (to me), the JSON content in the XML dumps is still not the same as in the JSON dumps. A large part of the records in the XML dump is broken because of the map-vs-list issue. For example, the latest dump of current revisions [2] has countless instances of the problem. The first is in the item Q3261 (empty list for claims), but you can easily find more by grepping for things like "claims":[] It seems that all empty maps are serialized wrongly in this dump (aliases, descriptions, claims, ...). In contrast, the site's export simply omits the key of empty maps entirely, see https://www.wikidata.org/wiki/Special:EntityData/Q3261.json The JSON in the JSON dumps is the same. Cheers, Markus [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77 [2] http://dumps.wikimedia.org/wikidatawiki/20150207/ wikidatawiki-20150207-pages-meta-current.xml.bz2 >>> >>> >>> ___ >>> Wikidata-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >>> >>> >> ___ >> Wikidata-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >> >> > > ___ > Wikidata-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-l > -- Kontokostas Dimitris ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
On 26.02.2015 21:40, Martynas Jusevičius wrote: Looks like someone hasn't learned the lesson: https://www.mail-archive.com/[email protected]/msg02588.html No, this post is unrelated. The cause of the problem was correctly analysed by Stas. Markus On Thu, Feb 26, 2015 at 9:27 PM, Lukas Benedix wrote: I second this! btw: what is the status of the problem with the missing dumps with history? (latest available from November 2014) Lukas Am Do 26.02.2015 um 14:52 schrieb Markus Kroetzsch: Hi, It's that time of the year again when I am sending a reminder that we still have broken JSON in the dump files ;-). As usual, the problem is that empty maps {} are serialized wrongly as empty lists []. I am not sure if there is any open bug that tracks this, so I am sending an email. There was one, but it was closed [1]. As you know (I had sent an email a while ago), there are some remaining problems of this kind in the JSON dump, and also in the live exported JSON, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json (uses [] as a value for snaks: this item has a reference with an empty list of snaks, which is an error by itself) However, the situation is considerably worse in the XML dumps, which have seen less usage since we have JSON, but as it turns out are still preferred by some users. Surprisingly (to me), the JSON content in the XML dumps is still not the same as in the JSON dumps. A large part of the records in the XML dump is broken because of the map-vs-list issue. For example, the latest dump of current revisions [2] has countless instances of the problem. The first is in the item Q3261 (empty list for claims), but you can easily find more by grepping for things like "claims":[] It seems that all empty maps are serialized wrongly in this dump (aliases, descriptions, claims, ...). In contrast, the site's export simply omits the key of empty maps entirely, see https://www.wikidata.org/wiki/Special:EntityData/Q3261.json The JSON in the JSON dumps is the same. Cheers, Markus [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77 [2] http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2 ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
Hi Stas,
Since the JSON dumps and EntityData exports are (largely) free of
errors, there is already code for fixing this problem. Maybe we could
just use this.
Cheers,
Markus
On 27.02.2015 01:06, Stas Malyshev wrote:
Hi!
It's that time of the year again when I am sending a reminder that we
still have broken JSON in the dump files ;-). As usual, the problem is
that empty maps {} are serialized wrongly as empty lists []. I am not
This seems to be consequence of using json_encode(), which does
serialize empty arrays as [], unless given JSON_FORCE_OBJECT option.
Unfortunately, this option would make all lists into objects (maps) so
we can't just use it directly. So probably the best way would be to just
drop the empty property? Unless it'd break something else.
Another trick would be to put there "new stdclass" instead of empty
array - that would encode to {}.
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
AFAIK there is no php involved in the dump process (python?)
There was a mail that announced a switch to the new serialisation format
in July 2014
[https://lists.wikimedia.org/pipermail/wikidata-l/2014-July/004225.html]
And some other mails adressing the JSON-Problem in Sep. 2014
[https://lists.wikimedia.org/pipermail/wikidata-tech/2014-September/000586.html]
[https://lists.wikimedia.org/pipermail/wikidata-tech/2014-September/000576.html]
Is there anything any specific point where a volunteer can help fixing
this issue? (I would love to see a consistent dump with history…)
Lukas
Am Fr 27.02.2015 um 01:06 schrieb Stas Malyshev:
> Hi!
>
>> It's that time of the year again when I am sending a reminder that we
>> still have broken JSON in the dump files ;-). As usual, the problem is
>> that empty maps {} are serialized wrongly as empty lists []. I am not
>
> This seems to be consequence of using json_encode(), which does
> serialize empty arrays as [], unless given JSON_FORCE_OBJECT option.
> Unfortunately, this option would make all lists into objects (maps) so
> we can't just use it directly. So probably the best way would be to just
> drop the empty property? Unless it'd break something else.
> Another trick would be to put there "new stdclass" instead of empty
> array - that would encode to {}.
>
signature.asc
Description: OpenPGP digital signature
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
Hi!
> It's that time of the year again when I am sending a reminder that we
> still have broken JSON in the dump files ;-). As usual, the problem is
> that empty maps {} are serialized wrongly as empty lists []. I am not
This seems to be consequence of using json_encode(), which does
serialize empty arrays as [], unless given JSON_FORCE_OBJECT option.
Unfortunately, this option would make all lists into objects (maps) so
we can't just use it directly. So probably the best way would be to just
drop the empty property? Unless it'd break something else.
Another trick would be to put there "new stdclass" instead of empty
array - that would encode to {}.
--
Stas Malyshev
[email protected]
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
Looks like someone hasn't learned the lesson: https://www.mail-archive.com/[email protected]/msg02588.html On Thu, Feb 26, 2015 at 9:27 PM, Lukas Benedix wrote: > I second this! > > > btw: what is the status of the problem with the missing dumps with > history? (latest available from November 2014) > > Lukas > > Am Do 26.02.2015 um 14:52 schrieb Markus Kroetzsch: >> Hi, >> >> It's that time of the year again when I am sending a reminder that we >> still have broken JSON in the dump files ;-). As usual, the problem is >> that empty maps {} are serialized wrongly as empty lists []. I am not >> sure if there is any open bug that tracks this, so I am sending an >> email. There was one, but it was closed [1]. >> >> As you know (I had sent an email a while ago), there are some remaining >> problems of this kind in the JSON dump, and also in the live exported >> JSON, e.g., >> >> https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json >> (uses [] as a value for snaks: this item has a reference with an empty >> list of snaks, which is an error by itself) >> >> However, the situation is considerably worse in the XML dumps, which >> have seen less usage since we have JSON, but as it turns out are still >> preferred by some users. Surprisingly (to me), the JSON content in the >> XML dumps is still not the same as in the JSON dumps. A large part of >> the records in the XML dump is broken because of the map-vs-list issue. >> >> For example, the latest dump of current revisions [2] has countless >> instances of the problem. The first is in the item Q3261 (empty list for >> claims), but you can easily find more by grepping for things like >> >> "claims":[] >> >> It seems that all empty maps are serialized wrongly in this dump >> (aliases, descriptions, claims, ...). In contrast, the site's export >> simply omits the key of empty maps entirely, see >> >> https://www.wikidata.org/wiki/Special:EntityData/Q3261.json >> >> The JSON in the JSON dumps is the same. >> >> Cheers, >> >> Markus >> >> >> [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77 >> [2] >> http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2 >> >> > > > > ___ > Wikidata-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-l > ___ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Broken JSON in XML dumps
I second this!
btw: what is the status of the problem with the missing dumps with
history? (latest available from November 2014)
Lukas
Am Do 26.02.2015 um 14:52 schrieb Markus Kroetzsch:
> Hi,
>
> It's that time of the year again when I am sending a reminder that we
> still have broken JSON in the dump files ;-). As usual, the problem is
> that empty maps {} are serialized wrongly as empty lists []. I am not
> sure if there is any open bug that tracks this, so I am sending an
> email. There was one, but it was closed [1].
>
> As you know (I had sent an email a while ago), there are some remaining
> problems of this kind in the JSON dump, and also in the live exported
> JSON, e.g.,
>
> https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json
> (uses [] as a value for snaks: this item has a reference with an empty
> list of snaks, which is an error by itself)
>
> However, the situation is considerably worse in the XML dumps, which
> have seen less usage since we have JSON, but as it turns out are still
> preferred by some users. Surprisingly (to me), the JSON content in the
> XML dumps is still not the same as in the JSON dumps. A large part of
> the records in the XML dump is broken because of the map-vs-list issue.
>
> For example, the latest dump of current revisions [2] has countless
> instances of the problem. The first is in the item Q3261 (empty list for
> claims), but you can easily find more by grepping for things like
>
> "claims":[]
>
> It seems that all empty maps are serialized wrongly in this dump
> (aliases, descriptions, claims, ...). In contrast, the site's export
> simply omits the key of empty maps entirely, see
>
> https://www.wikidata.org/wiki/Special:EntityData/Q3261.json
>
> The JSON in the JSON dumps is the same.
>
> Cheers,
>
> Markus
>
>
> [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77
> [2]
> http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2
>
>
signature.asc
Description: OpenPGP digital signature
___
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
