https://bugzilla.wikimedia.org/show_bug.cgi?id=29564
Web browser: ---
Bug #: 29564
Summary: Bad UTF-8 in ThreadSignature breaks huwiki XML dumps
and Special:Export
Product: MediaWiki extensions
Version: any
Platform: All
OS/Version: All
Status: NEW
Severity: major
Priority: Unprioritized
Component: LiquidThreads
AssignedTo: [email protected]
ReportedBy: [email protected]
CC: [email protected]
Classification: Unclassified
Export of one of the discussion threads (this is page ID 803932 in huwiki_p):
https://secure.wikimedia.org/wikipedia/hu/wiki/Speciális:Lapok_exportálása/Téma:Szerkesztővita:Dencey/Fölösleges_információk/válasz_(3)
contains invalid (truncated) probably UTF-8 for the thread poster signature.
Hexdump of the export page reveals:
00000be0 74 3b 67 72 65 65 6e 26 71 75 6f 74 3b 20 66 61 |t;green" fa|
00000bf0 63 65 3d 26 71 75 6f 74 3b 4c 75 63 69 64 61 20 |ce="Lucida |
00000c00 63 61 6c 6c 69 67 72 61 70 68 79 26 71 75 6f 74 |calligraphy"|
00000c10 3b 26 67 74 3b ce 93 ce bf cf 85 ce b2 ce b2 ce |;>...........|
00000c20 bf cf 82 20 ce 98 ce b9 ce bb ce bf ce 3c 2f 54 |... .........</T|
00000c30 68 72 65 61 64 53 69 67 6e 61 74 75 72 65 3e 0a |hreadSignature>.|
0xCE byte at offset 0x00000c2a should be followed by at least one more byte to
get a correct UTF-8 encoding.
XML dump process fails silently - the last page in those dumps:
http://download.wikimedia.org/huwiki/20110531/huwiki-20110531-pages-articles.xml.bz2
http://download.wikimedia.org/huwiki/20110614/huwiki-20110614-pages-articles.xml.bz2
is page ID 803931, after this there is no XML so whole dump is a non-valid XML.
It gets compressed via bzip2, though.
This problem was reported on the pywikipedia mailing list by Bináris:
http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/11335
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l