Mark,

On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) <
gm...@kennel17.co.uk> wrote:

> Hi all,
>
> I seem to recall that a long, long time ago MediaWiki was using UTF-8
> internally but storing the data in 'latin1' fields in MySQL.
>
> I notice that there is now the option to use either 'utf8' or 'binary'
> columns (via the $wgDBmysql5 setting), and the default appears to be
> 'binary'.[1]
>

I can provide you general information about  the MySQL side of things.

'utf8' in MySQL is 3-bytes UTF-8. "Real" UTF-8 is called in MySQL utf8mb4.
While this may sound silly, think that emojies and characters beyond the
basic multilingual plane were probably more theoretical than practical
10-15 years ago, and variable-string performance was not good for MySQL on
those early versions.

I know there was some conversion pain in the past, but right now, in order
to be as compatible as possible, on WMF servers binary collation is being
used almost everywhere (there may be some old text not converted, but this
is true for most live data/metadata databases that I have seen). Mediawiki
only requires MySQL 5.0 and using binary strings allows to support
collations and charsets only available on the latest MySQL/MariaDB versions.

On the latest discussions, there are proposals to increase the minimum
mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
(not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232. Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for
example), although I am not up to date with the latest unicode standard
changes and MySQL features supporting them.

I've come across an old project which followed MediaWiki's lead (literally
> - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
> tables.  I need to upgrade it to a more modern data infrastructure, but I'm
> hesitant to simply switch to 'utf8' without understanding the reasons for
> this initial implementation decision.
>

I strongly suggest to go for utf8mb4, if mysql >=5.5, and only binary if
you have some special needs that that doesn't cover. InnoDB variable-length
performance has been "fixed" on the newest InnoDB versions and it is the
recommended deafault nowadays.

Cheers,
-- 
Jaime Crespo
<http://wikimedia.org>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to