On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo <[email protected]> wrote: > On Tue, May 2, 2017 at 9:24 PM, Brian Wolff <[email protected]> wrote: > >> . >> > >> > On the latest discussions, there are proposals to increase the minimum >> > mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4 >> > (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232. >> Utf8mb4 >> > should be enough for most uses (utf8 will not allow for emojis, for >> > example), although I am not up to date with the latest unicode standard >> > changes and MySQL features supporting them. >> > >> >> I dont know about mysql, but in unicode emojis are like any other astral >> character, and utf-8 can encode them in 4 bytes*. >> > > I am sorry I wasn't clear before, MySQL's utf8 IS NOT international > standard generally known as UTF-8, it is a bastardization of 3-byte max > UTF-8. MySQL's utf8mb4 is UTF-8: > > Proof: > > ``` > mysql> use test > Database changed > mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET > utf8mb4, c binary(4)); > Query OK, 0 rows affected (0.02 sec) > > mysql> SET NAMES utf8mb4; > Query OK, 0 rows affected (0.00 sec) > > mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); > Query OK, 1 row affected, 1 warning (0.00 sec) > > mysql> SHOW WARNINGS; > +---------+------+--------------------------------------------------------------------+ > | Level | Code | Message > | > +---------+------+--------------------------------------------------------------------+ > | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column > 'a' at row 1 | > +---------+------+--------------------------------------------------------------------+ > 1 row in set (0.01 sec) > > mysql> SELECT * FROM test; > +------+------+------+ > | a | b | c | > +------+------+------+ > | ? | 💩 | 💩 | -- you will need an emoji-compatible font here > +------+------+------+ > 1 row in set (0.00 sec) > > mysql> SELECT hex(a), hex(b), hex(c) FROM test; > +--------+----------+----------+ > | hex(a) | hex(b) | hex(c) | > +--------+----------+----------+ > | 3F | F09F92A9 | F09F92A9 | > +--------+----------+----------+ > 1 row in set (0.00 sec) > ``` > > To avoid truncations: > > ``` > mysql> set sql_mode='TRADITIONAL'; -- > https://phabricator.wikimedia.org/T108255 > Query OK, 0 rows affected (0.00 sec) > > mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9'); > ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column > 'a' at row 1 > ``` > > More info at: > https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs. > https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html > > > -- > Jaime Crespo > <http://wikimedia.org> > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Oh my bad, I had misread your previous email. I thought you were talking about emoiji's in utf8mb4. -- Brian _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
