On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) < [email protected]> wrote:
> I seem to recall that a long, long time ago MediaWiki was using UTF-8 > internally but storing the data in 'latin1' fields in MySQL. > Indeed. See $wgLegacyEncoding <https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding> (and T128149 <https://phabricator.wikimedia.org/T128149>/T155529 <https://phabricator.wikimedia.org/T155529>). > I notice that there is now the option to use either 'utf8' or 'binary' > columns (via the $wgDBmysql5 setting), and the default appears to be > 'binary'.[1] I've come across an old project which followed MediaWiki's lead (literally > - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1 > tables. I need to upgrade it to a more modern data infrastructure, but I'm > hesitant to simply switch to 'utf8' without understanding the reasons for > this initial implementation decision. > utf8 uses three bytes per character (ie. BMP only) so it's not a good idea to use it. utf8mb4 should work in theory. I think the only reason we don't use it is inertia (compatibility problems with old MySQL versions; lack of testing with MediaWiki; difficulty of migrating huge Wikimedia datasets). _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
