On 05/23/2013 11:31 PM, [email protected] wrote: > Hi, > > I'm a testing a new rendering option for the <math /> element and had > problems to store MathML elements in the database field > math_mathml which is of type text. > The MathML elements contain a wide range of Unicode characters like the > INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte > chars like MATHEMATICAL BOLD CAPITAL A 0xF0 0x9D 0x90 0x80 . > In some rar cases I had problem to retrieve the stored value correctly from > MySQL. > To fix that problem I'm now using the PHP functions utf8_encode /decode to > which is not a very intuitive solution. > Do you know a better method to solve this issue without to change the > database layout. > > Best > Physikerwelt
If you use MySQL, when you installed MediaWiki (or created the table), did you choose the "UTF-8" option instead of "binary"? The underlying MySQL character set is "utf8"[1], which does not support characters above U+FFFF (four-byte characters). This is mentioned in the web installer (message 'config-charset-help'): > In binary mode, MediaWiki stores UTF-8 text to the database in binary > fields. This is more efficient than MySQL's UTF-8 mode, and allows > you to use the full range of Unicode characters. In UTF-8 mode, MySQL > will know what character set your data is in, and can present and > convert it appropriately, but it will not let you store characters > above the Basic Multilingual Plane[2]." MySQL 5.5 did introduce a new "utf8mb4" character set, which does support four-byte characters; however, MediaWiki does not currently support that option (now filed as bug 48767). The WMF of course has to use the 'binary' option (actually, UTF-8 stored in latin1 columns, as mentioned in bug 32217) to allow storage of all sorts of obscure characters from different languages. utf8_encode()/utf8_decode() work around the problem because they replace byte values 80 to FF with two-byte characters from U+0080 to U+00FF, (encoded as C2 80 to C3 BF) and the 'utf8' option does allow those characters. [1]: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html [2]: http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes -- Wikipedia user PleaseStand http://en.wikipedia.org/wiki/User:PleaseStand _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
