https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

       Web browser: ---
             Bug #: 29564
           Summary: Bad UTF-8 in ThreadSignature breaks huwiki XML dumps
                    and Special:Export
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: Unprioritized
         Component: LiquidThreads
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]
    Classification: Unclassified


Export of one of the discussion threads (this is page ID 803932 in huwiki_p):

https://secure.wikimedia.org/wikipedia/hu/wiki/Speciális:Lapok_exportálása/Téma:Szerkesztővita:Dencey/Fölösleges_információk/válasz_(3)

contains invalid (truncated) probably UTF-8 for the thread poster signature.

Hexdump of the export page reveals:

00000be0  74 3b 67 72 65 65 6e 26  71 75 6f 74 3b 20 66 61  |t;green" fa|
00000bf0  63 65 3d 26 71 75 6f 74  3b 4c 75 63 69 64 61 20  |ce="Lucida |
00000c00  63 61 6c 6c 69 67 72 61  70 68 79 26 71 75 6f 74  |calligraphy&quot|
00000c10  3b 26 67 74 3b ce 93 ce  bf cf 85 ce b2 ce b2 ce  |;>...........|
00000c20  bf cf 82 20 ce 98 ce b9  ce bb ce bf ce 3c 2f 54  |... .........</T|
00000c30  68 72 65 61 64 53 69 67  6e 61 74 75 72 65 3e 0a  |hreadSignature>.|

0xCE byte at offset 0x00000c2a should be followed by at least one more byte to
get a correct UTF-8 encoding.

XML dump process fails silently - the last page in those dumps:

http://download.wikimedia.org/huwiki/20110531/huwiki-20110531-pages-articles.xml.bz2

http://download.wikimedia.org/huwiki/20110614/huwiki-20110614-pages-articles.xml.bz2

is page ID 803931, after this there is no XML so whole dump is a non-valid XML.

It gets compressed via bzip2, though. 

This problem was reported on the pywikipedia mailing list by Bináris:

http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/11335

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to