On 05/23/2013 11:31 PM, [email protected] wrote:
> Hi,
> 
> I'm a testing a new rendering option for the <math /> element and had
> problems to store MathML elements in the database field
> math_mathml which is of type text.
> The MathML elements contain a wide range of Unicode characters like the
> INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte
> chars like MATHEMATICAL BOLD CAPITAL A  0xF0 0x9D 0x90 0x80 .
> In some rar cases I had problem to retrieve the stored value correctly from
> MySQL.
> To fix that problem I'm now using the PHP functions utf8_encode /decode to
> which is not a very intuitive solution.
> Do you know a better method to solve this issue without to change the
> database layout.
> 
> Best
> Physikerwelt

If you use MySQL, when you installed MediaWiki (or created the table),
did you choose the "UTF-8" option instead of "binary"? The underlying
MySQL character set is "utf8"[1], which does not support characters
above U+FFFF (four-byte characters).

This is mentioned in the web installer (message 'config-charset-help'):

> In binary mode, MediaWiki stores UTF-8 text to the database in binary
> fields. This is more efficient than MySQL's UTF-8 mode, and allows
> you to use the full range of Unicode characters. In UTF-8 mode, MySQL
> will know what character set your data is in, and can present and
> convert it appropriately, but it will not let you store characters
> above the Basic Multilingual Plane[2]."

MySQL 5.5 did introduce a new "utf8mb4" character set, which does
support four-byte characters; however, MediaWiki does not currently
support that option (now filed as bug 48767).

The WMF of course has to use the 'binary' option (actually, UTF-8 stored
in latin1 columns, as mentioned in bug 32217) to allow storage
of all sorts of obscure characters from different languages.

utf8_encode()/utf8_decode() work around the problem because they replace
byte values 80 to FF with two-byte characters from U+0080 to U+00FF,
(encoded as C2 80 to C3 BF) and the 'utf8' option does allow those
characters.

[1]: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html
[2]: http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
-- 
Wikipedia user PleaseStand
http://en.wikipedia.org/wiki/User:PleaseStand

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to