Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Ariel T. Glenn
The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29action=editoldid=386385 before there were aggressive checks for that sort of thing. Garbage in, garbage out...

Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Federico Leva (Nemo)
Ariel T. Glenn, 08/01/2013 09:26: The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29action=editoldid=386385 I've requested removal and revdeletion:

[Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-05 Thread Mathieu Poumeyrol
All, I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8. http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz but: $ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8