https://bugzilla.wikimedia.org/show_bug.cgi?id=68724
Bawolff (Brian Wolff) <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|GWToolset uploads files |GWToolset should assume non |with unknown characters in |unicode characters are |title |windows-1252 not iso 8859-1 --- Comment #1 from Bawolff (Brian Wolff) <[email protected]> --- Ok. What happened is that the data was originally in a character set called windows-1252. In that character set "œ" is encoded as 0x9C. Somewhere along the lines, it got converted to utf-8, but during the conversion process it was assumed that the original data was in a character set called iso-8859-1. That character set uses 0x9C to mean "STRING TERMINATOR", which is an invisible character. So the end result is the image had a title in MW with 0xC2 0x9C which is the UTF-8 code for "STRING TERMINATOR", instead of 0xC5 0x93 which is the UTF-8 code for LATIN SMALL LIGATURE OE. ----- Its hard to tell at what step the error occurred. If the conversion error happened in the csv->xml transformation then its not gwtoolsets fault. If the error occured in the xml->upload step, then it would be. Could you maybe upload the relavent csv and xml files as attachments (Copy and pasting into bugzilla comments messes with the encoding) -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
