On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) <
[email protected]> wrote:

> I seem to recall that a long, long time ago MediaWiki was using UTF-8
> internally but storing the data in 'latin1' fields in MySQL.
>

Indeed. See $wgLegacyEncoding
<https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding> (and T128149
<https://phabricator.wikimedia.org/T128149>/T155529
<https://phabricator.wikimedia.org/T155529>).


> I notice that there is now the option to use either 'utf8' or 'binary'
> columns (via the $wgDBmysql5 setting), and the default appears to be
> 'binary'.[1]



I've come across an old project which followed MediaWiki's lead (literally
> - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
> tables.  I need to upgrade it to a more modern data infrastructure, but I'm
> hesitant to simply switch to 'utf8' without understanding the reasons for
> this initial implementation decision.
>

utf8 uses three bytes per character (ie. BMP only) so it's not a good idea
to use it. utf8mb4 should work in theory. I think the only reason we don't
use it is inertia (compatibility problems with old MySQL versions; lack of
testing with MediaWiki; difficulty of migrating huge Wikimedia datasets).
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to