[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-11 Thread ArielGlenn
ArielGlenn removed projects: Wikidata, Wikidata-Query-Service, Analytics.

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, ArielGlenn
Cc: Lucas_Werkmeister_WMDE, ArielGlenn, Milimetric, Aklapper, marcmiquel, 
Strainu, jannee_e, Lunewa, gnosygnu, CBogen, Akuckartz, 4748kitoko, 
darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, Gq86, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-09 Thread marcmiquel
marcmiquel added a comment.


  Thank you @ArielGlenn and @Lucas_Werkmeister_WMDE,
  
  So, to explain what I am doing ( https://pastebin.com/kPrwQ0Lb ).
  
  I am first collecting all the categories from the page dump and put them into 
some dictionaries.
  Then, I am parsing the categorylinks dump and I add the page_ids these 
categories contain.
  
  The problem is in the category titles in which there are these special 
characters. 
  The first dump seems to work, but the second shows these hex bytes.
  
  Perhaps it is something with how the second dump must be opened or read, but 
I cannot find a way to read it in ('utf-8'). I just put the print ('error') and 
I see many.
  What could I do? Thanks.

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, marcmiquel
Cc: Lucas_Werkmeister_WMDE, ArielGlenn, Milimetric, Aklapper, marcmiquel, 
Strainu, jannee_e, CBogen, Akuckartz, 4748kitoko, darthmon_wmde, Nandana, 
Namenlos314, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, EBjune, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, gnosygnu, 
JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-09 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  The encoding looks correct in my terminal:
  
$ curl -s 
https://dumps.wikimedia.org/rowiki/20201001/rowiki-20201001-categorylinks.sql.gz
 | gunzip | sed 's/),(/),\n(/g' | grep -aF Dansuri_rom

(750456,'Dansuri_românești','GEGQ?)K1/)CMQK9GEGQ?)K1KEA*C1NO9%ܾ','2010-08-04 
16:36:40','Populare','uca-ro-u-kn','subcat'),
(770750,'Dansuri_românești','+K*Q   /)CM','2012-03-01 
17:39:10','','uca-ro-u-kn','page'),

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, ArielGlenn, Milimetric, Aklapper, marcmiquel, 
Strainu, jannee_e, CBogen, Akuckartz, 4748kitoko, darthmon_wmde, Nandana, 
Namenlos314, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, EBjune, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, gnosygnu, 
JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added a comment.


  In T264850#6531377 , 
@Milimetric wrote:
  
  > @ArielGlenn is this something you'd know about or know who to point me to?
  
  I think the wdqs folks are going to be your best bet, I've added the project. 
Looks like a simple text encoding error, but I'd like to know exactly what 
tools were used to display the text before saying that for sure.

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, ArielGlenn
Cc: ArielGlenn, Milimetric, Aklapper, marcmiquel, Strainu, jannee_e, CBogen, 
Akuckartz, 4748kitoko, darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Lunewa, QZanden, EBjune, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, gnosygnu, 
JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added a comment.


echo -n ânești  | od -t x1
000 c3 a2 6e 65 c8 99 74 69
  
  You appear to be seeing a string representation of the non-ascii characters 
as hex bytes, i.e. xc3 xa2 ne xc8 x99 ti.   What command are you using to 
display the test in the file, and on what platform?

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, ArielGlenn
Cc: ArielGlenn, Milimetric, Aklapper, marcmiquel, Strainu, jannee_e, CBogen, 
Akuckartz, 4748kitoko, darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Lunewa, QZanden, EBjune, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, gnosygnu, 
JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T264850: Categorylinks dump might have some problem with the encoding

2020-10-08 Thread ArielGlenn
ArielGlenn added projects: Wikidata-Query-Service, Dumps-Generation.
Restricted Application added a project: Wikidata.

TASK DETAIL
  https://phabricator.wikimedia.org/T264850

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou, ArielGlenn
Cc: ArielGlenn, Milimetric, Aklapper, marcmiquel, Strainu, jannee_e, CBogen, 
Akuckartz, 4748kitoko, darthmon_wmde, Nandana, Namenlos314, Akovalyov, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Lunewa, QZanden, EBjune, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, gnosygnu, 
JAllemandou, terrrydactyl, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs