[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-11 Thread ArielGlenn
ArielGlenn removed projects: Wikidata, Wikidata-Query-Service, Analytics. TASK DETAIL https://phabricator.wikimedia.org/T264850 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou, ArielGlenn Cc: Lucas_Werkmeister_WMDE, ArielGlenn,

[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-09 Thread marcmiquel
marcmiquel added a comment. Thank you @ArielGlenn and @Lucas_Werkmeister_WMDE, So, to explain what I am doing ( https://pastebin.com/kPrwQ0Lb ). I am first collecting all the categories from the page dump and put them into some dictionaries. Then, I am parsing the categorylinks

[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-09 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment. The encoding looks correct in my terminal: $ curl -s https://dumps.wikimedia.org/rowiki/20201001/rowiki-20201001-categorylinks.sql.gz | gunzip | sed 's/),(/),\n(/g' | grep -aF Dansuri_rom

[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added a comment. In T264850#6531377 , @Milimetric wrote: > @ArielGlenn is this something you'd know about or know who to point me to? I think the wdqs folks are going to be your best bet, I've added the project. Looks

[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added a comment. echo -n ânești | od -t x1 000 c3 a2 6e 65 c8 99 74 69 You appear to be seeing a string representation of the non-ascii characters as hex bytes, i.e. xc3 xa2 ne xc8 x99 ti. What command are you using to display the test in the file, and on what

[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added projects: Wikidata-Query-Service, Dumps-Generation. Restricted Application added a project: Wikidata. TASK DETAIL https://phabricator.wikimedia.org/T264850 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou, ArielGlenn Cc: