On Mon, Apr 30, 2018 at 9:05 AM, Jaime Crespo <[email protected]> wrote:

> * Support "real" (4-byte) UTF-8: utf8mb4 in MySQL/MariaDB (default in the
> latest versions) and start deprecating "fake"  (3-byte) UTF-8: utf8
>

MediaWiki currently doesn't even try to support UTF-8 in MySQL. The core
MySQL schema specifically uses "varbinary" and "blob" types for almost
everything.

Ideally we'd change that, but see below.


> * Check code works as intended in "strict" mode (default in the latest
> versions), at least regarding testing
>

While it's not actually part of "strict mode" (I think), I note that
MariaDB 10.1.32 (tested on db1114) with ONLY_FULL_GROUP_BY still seems to
have the issues described in
https://phabricator.wikimedia.org/T108255#2415773.


> Anomie- I think you were thinking on (maybe?) abstracting schema for
> mediawiki- fixing the duality of binary (defining sizes in bytes) vs. UTF-8
> (defining sizes in characters) would be an interesting problem to solve-
> the duality is ok, what I mean is being able to store radically different
> size of contents based on that setting.
>

That would be an interesting problem to solve, but doing so may be
difficult. We have a number of fields that are currently defined as
varbinary(255) and are fully indexed (i.e. not using a prefix).

   - Just changing them to varchar(255) using utf8mb4 makes the index
   exceed MySQL's column length limit.
   - Changing them to varchar(191) to keep within the length limit breaks
   content in primarily-ASCII languages that is taking advantage of the
   existing 255-byte limit to store more than 191 codepoints.
   - Using a prefixed index makes ORDER BY on the column filesort.
   - Or the column length limit can be raised if your installation jumps
   through some hoops, which seem to be the default in 5.7.7 but not before:
   innodb_large_prefix
   
<https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_large_prefix>
   set to ON, innodb_file_format
   
<https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_file_format>
   set to "Barracuda", innodb_file_per_table
   
<https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_file_per_table>
   set to ON, and tables created with ROW_FORMAT=DYNAMIC or COMPRESSED. I
   don't know what MariaDB might have as defaults or requirements in which
   versions.

The ideal, I suppose, would be to require those hoops be jumped through in
order for utf8mb4 mode to be enabled. Then a lot of code in MediaWiki would
have to vary based on that mode flag to enforce limits on bytes versus
codepoints.

BTW, for anyone reading this who's interested, the task for that schema
abstraction idea is https://phabricator.wikimedia.org/T191231.

-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to