https://bugzilla.wikimedia.org/show_bug.cgi?id=68724

Bawolff (Brian Wolff) <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|GWToolset uploads files     |GWToolset should assume non
                   |with unknown characters in  |unicode characters are
                   |title                       |windows-1252 not iso 8859-1

--- Comment #1 from Bawolff (Brian Wolff) <[email protected]> ---
Ok. What happened is that the data was originally in a character set called
windows-1252. In that character set "œ" is encoded as 0x9C. Somewhere along the
lines, it got converted to utf-8, but during the conversion process it was
assumed that the original data was in a character set called iso-8859-1. That
character set uses 0x9C to mean "STRING TERMINATOR", which is an invisible
character.

So the end result is the image had a title in MW with 0xC2 0x9C which is the
UTF-8 code for "STRING TERMINATOR", instead of 0xC5 0x93 which is the UTF-8
code for LATIN SMALL LIGATURE OE.

-----

Its hard to tell at what step the error occurred. If the conversion error
happened in the csv->xml transformation then its not gwtoolsets fault. If the
error occured in the xml->upload step, then it would be. Could you maybe upload
the relavent csv and xml files as attachments (Copy and pasting into bugzilla
comments messes with the encoding)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to