Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
note to self: look into the code that order text (collation) in mediawiki, has to be fun one :-) -- -- ℱin del ℳensaje. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
Tei schrieb: note to self: look into the code that order text (collation) in mediawiki, has to be fun one :-) There is none. Sorting is done by the database. That is to say, in the default comnpatibility mode, binary collation is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support. If you set up the database to use proper UTF-8, collation is a bit better (though still not configurable, i think). But it crashes hard if you try to store characters that are outside the Basic Multilingual Plane (Gothic runes, some obscure Chinese characters, ...) - that's why this is not used on wikipedia. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
On Wed, Mar 11, 2009 at 6:14 AM, Daniel Kinzler dan...@brightbyte.de wrote: There is none. Sorting is done by the database. That is to say, in the default comnpatibility mode, binary collation is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support. And until we upgrade to that version. MySQL 4 doesn't have *any* Unicode support -- or any character encoding support, in fact. Every is binary. But we don't have to wait on MySQL. We would just have to store a Unicode sortkey in cl_sortkey instead of the actual Unicode characters. This would require an implementation of a Unicode sorting algorithm in MediaWiki. It could be language-specific or whatever you want. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
On Fri, Mar 6, 2009 at 3:54 PM, Daniel Kinzler dan...@brightbyte.de wrote: Again: never mind what it is declared as, it *is* UTF-8. MySQL may however automatically convert it on the way to the clinet or dump program. To prevent that, tell mysql that the encoding of your client is latin1. Confusing? Hell yea :) Best way is to use VARBINARY or BINARY ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
jida...@jidanni.org schrieb: Say, e.g., api.php?action=querylist=logevents looks fine, but when I look at the same table in an SQL dump, the Chinese utf8 is just a latin1 jumble. How can I convert such strings back to utf8? I can't find the place where MediaWiki converts them back and forth. It doesn't. it's already UTF8, only mysql things it's not. this is because mysql doesn't support utf8 before 5.0, and even in 5.0 and later, the support is flacky. So, mediawiki (per default) tells mysql that the data is latin1 and treates it as binary. If you see it asa jumble entirely depends on the program you view it with. this is a nasty hack, and it may cause corruption when importing/exporting dumps. be careful about it. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l