Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Tei
On 2 May 2017 at 19:10, Mark Clements (HappyDog) 
wrote:

> Hi all,
>
> I seem to recall that a long, long time ago MediaWiki was using UTF-8
> internally but storing the data in 'latin1' fields in MySQL.
>

I remember a old thread in 2009.

https://lists.gt.net/wiki/wikitech/160875




-- 
--
ℱin del ℳensaje.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Tim Starling
On 03/05/17 03:10, Mark Clements (HappyDog) wrote:
> Can anyone confirm that MediaWiki used to behave in this manner, and
> if so why?

In MySQL 4.0, MySQL didn't really have character sets, it only had
collations. Text was stored as 8-bit clean binary, and was only
interpreted as a character sequence when compared to other text fields
for collation purposes. There was no UTF-8 collation, so we stored
UTF-8 text in text fields with the default (latin1) collation.

> If it was due to MySQL bugs, does anyone know in what version these
> were fixed?

IIRC it was fixed in MySQL 4.1 with the introduction of proper
character sets.

To migrate such a database, you need to do an ALTER TABLE to switch
the relevant fields from latin1 to the "binary" character set. If you
ALTER TABLE directly to utf8, you'll end up with "mojibake", since the
text will be incorrectly interpreted as latin1 and converted to
unicode. This is unrecoverable, you have to restore from a backup if
this happens.

I think it is possible to then do an ALTER TABLE to switch from binary
to utf8, but it's been a while since I tested that.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread bawolff
On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo  wrote:
> On Tue, May 2, 2017 at 9:24 PM, Brian Wolff  wrote:
>
>> .
>> >
>> > On the latest discussions, there are proposals to increase the minimum
>> > mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
>> > (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
>> Utf8mb4
>> > should be enough for most uses (utf8 will not allow for emojis, for
>> > example), although I am not up to date with the latest unicode standard
>> > changes and MySQL features supporting them.
>> >
>>
>> I dont know about mysql, but in unicode emojis are like any other astral
>> character, and utf-8 can encode them in 4 bytes*.
>>
>
> I am sorry I wasn't clear before, MySQL's utf8 IS NOT international
> standard generally known as UTF-8, it is a bastardization of 3-byte max
> UTF-8. MySQL's utf8mb4 is UTF-8:
>
> Proof:
>
> ```
> mysql> use test
> Database changed
> mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET
> utf8mb4, c binary(4));
> Query OK, 0 rows affected (0.02 sec)
>
> mysql> SET NAMES utf8mb4;
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
> Query OK, 1 row affected, 1 warning (0.00 sec)
>
> mysql> SHOW WARNINGS;
> +-+--++
> | Level   | Code | Message
>|
> +-+--++
> | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column
> 'a' at row 1 |
> +-+--++
> 1 row in set (0.01 sec)
>
> mysql> SELECT * FROM test;
> +--+--+--+
> | a| b| c|
> +--+--+--+
> | ?|  |  | -- you will need an emoji-compatible font here
> +--+--+--+
> 1 row in set (0.00 sec)
>
> mysql> SELECT hex(a), hex(b), hex(c) FROM test;
> ++--+--+
> | hex(a) | hex(b)   | hex(c)   |
> ++--+--+
> | 3F | F09F92A9 | F09F92A9 |
> ++--+--+
> 1 row in set (0.00 sec)
> ```
>
> To avoid truncations:
>
> ```
> mysql> set sql_mode='TRADITIONAL'; --
> https://phabricator.wikimedia.org/T108255
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
> ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column
> 'a' at row 1
> ```
>
> More info at:
> https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs.
> https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
>
>
> --
> Jaime Crespo
> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Oh my bad, I had misread your previous email. I thought you were
talking about emoiji's in utf8mb4.

--
Brian

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Jaime Crespo
On Tue, May 2, 2017 at 9:24 PM, Brian Wolff  wrote:

> .
> >
> > On the latest discussions, there are proposals to increase the minimum
> > mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
> > (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
> Utf8mb4
> > should be enough for most uses (utf8 will not allow for emojis, for
> > example), although I am not up to date with the latest unicode standard
> > changes and MySQL features supporting them.
> >
>
> I dont know about mysql, but in unicode emojis are like any other astral
> character, and utf-8 can encode them in 4 bytes*.
>

I am sorry I wasn't clear before, MySQL's utf8 IS NOT international
standard generally known as UTF-8, it is a bastardization of 3-byte max
UTF-8. MySQL's utf8mb4 is UTF-8:

Proof:

```
mysql> use test
Database changed
mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET
utf8mb4, c binary(4));
Query OK, 0 rows affected (0.02 sec)

mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> SHOW WARNINGS;
+-+--++
| Level   | Code | Message
   |
+-+--++
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column
'a' at row 1 |
+-+--++
1 row in set (0.01 sec)

mysql> SELECT * FROM test;
+--+--+--+
| a| b| c|
+--+--+--+
| ?|  |  | -- you will need an emoji-compatible font here
+--+--+--+
1 row in set (0.00 sec)

mysql> SELECT hex(a), hex(b), hex(c) FROM test;
++--+--+
| hex(a) | hex(b)   | hex(c)   |
++--+--+
| 3F | F09F92A9 | F09F92A9 |
++--+--+
1 row in set (0.00 sec)
```

To avoid truncations:

```
mysql> set sql_mode='TRADITIONAL'; --
https://phabricator.wikimedia.org/T108255
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column
'a' at row 1
```

More info at:
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs.
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html


-- 
Jaime Crespo

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Brian Wolff
.
>
> On the latest discussions, there are proposals to increase the minimum
> mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
> (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
Utf8mb4
> should be enough for most uses (utf8 will not allow for emojis, for
> example), although I am not up to date with the latest unicode standard
> changes and MySQL features supporting them.
>

I dont know about mysql, but in unicode emojis are like any other astral
character, and utf-8 can encode them in 4 bytes*.

* there are some technicalities in that some emoiji are actually multiple
code points that join together to render one glyph (for example country
flags take 2  codepoints to make a flag for a total of 8 bytes, there are
some modifier characters that simply change the skin colour of emoiji, etc)
however this should not affect anything on the db layer as they are just
treated as multiple characters.

--
Brian
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Jaime Crespo
Mark,

On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) <
gm...@kennel17.co.uk> wrote:

> Hi all,
>
> I seem to recall that a long, long time ago MediaWiki was using UTF-8
> internally but storing the data in 'latin1' fields in MySQL.
>
> I notice that there is now the option to use either 'utf8' or 'binary'
> columns (via the $wgDBmysql5 setting), and the default appears to be
> 'binary'.[1]
>

I can provide you general information about  the MySQL side of things.

'utf8' in MySQL is 3-bytes UTF-8. "Real" UTF-8 is called in MySQL utf8mb4.
While this may sound silly, think that emojies and characters beyond the
basic multilingual plane were probably more theoretical than practical
10-15 years ago, and variable-string performance was not good for MySQL on
those early versions.

I know there was some conversion pain in the past, but right now, in order
to be as compatible as possible, on WMF servers binary collation is being
used almost everywhere (there may be some old text not converted, but this
is true for most live data/metadata databases that I have seen). Mediawiki
only requires MySQL 5.0 and using binary strings allows to support
collations and charsets only available on the latest MySQL/MariaDB versions.

On the latest discussions, there are proposals to increase the minimum
mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
(not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232. Utf8mb4
should be enough for most uses (utf8 will not allow for emojis, for
example), although I am not up to date with the latest unicode standard
changes and MySQL features supporting them.

I've come across an old project which followed MediaWiki's lead (literally
> - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
> tables.  I need to upgrade it to a more modern data infrastructure, but I'm
> hesitant to simply switch to 'utf8' without understanding the reasons for
> this initial implementation decision.
>

I strongly suggest to go for utf8mb4, if mysql >=5.5, and only binary if
you have some special needs that that doesn't cover. InnoDB variable-length
performance has been "fixed" on the newest InnoDB versions and it is the
recommended deafault nowadays.

Cheers,
-- 
Jaime Crespo

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Gergő Tisza
On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) <
gm...@kennel17.co.uk> wrote:

> I seem to recall that a long, long time ago MediaWiki was using UTF-8
> internally but storing the data in 'latin1' fields in MySQL.
>

Indeed. See $wgLegacyEncoding
 (and T128149
/T155529
).


> I notice that there is now the option to use either 'utf8' or 'binary'
> columns (via the $wgDBmysql5 setting), and the default appears to be
> 'binary'.[1]



I've come across an old project which followed MediaWiki's lead (literally
> - it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
> tables.  I need to upgrade it to a more modern data infrastructure, but I'm
> hesitant to simply switch to 'utf8' without understanding the reasons for
> this initial implementation decision.
>

utf8 uses three bytes per character (ie. BMP only) so it's not a good idea
to use it. utf8mb4 should work in theory. I think the only reason we don't
use it is inertia (compatibility problems with old MySQL versions; lack of
testing with MediaWiki; difficulty of migrating huge Wikimedia datasets).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread James Hare
I thought MediaWiki, by default, stored data as binary blobs, rather than
something of a particular encoding?

On May 2, 2017 at 10:11:38 AM, Mark Clements (HappyDog) (
gm...@kennel17.co.uk) wrote:

Hi all,

I seem to recall that a long, long time ago MediaWiki was using UTF-8
internally but storing the data in 'latin1' fields in MySQL.

I notice that there is now the option to use either 'utf8' or 'binary'
columns (via the $wgDBmysql5 setting), and the default appears to be
'binary'.[1]

I've come across an old project which followed MediaWiki's lead (literally
-
it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm
hesitant to simply switch to 'utf8' without understanding the reasons for
this initial implementation decision.

Can anyone confirm that MediaWiki used to behave in this manner, and if so
why?

If it was due to MySQL bugs, does anyone know in what version these were
fixed?

Finally, is current best-practice to use 'binary' or 'utf-8' for this? Why
does MediaWiki make this configurable?

I have a very good understanding of character encodings and have no
problems
with performing whatever migrations are necessary - and the code itself is
fully utf-8 compliant except for the database layer - but I'm just trying
to
understand the design choices (or technical limitations) that resulted in
MediaWiki handling character encodings in this manner.

- Mark Clements (HappyDog)

[1] https://www.mediawiki.org/wiki/Manual:$wgDBmysql5



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Historical use of latin1 fields in MySQL

2017-05-02 Thread Mark Clements (HappyDog)

Hi all,

I seem to recall that a long, long time ago MediaWiki was using UTF-8 
internally but storing the data in 'latin1' fields in MySQL.


I notice that there is now the option to use either 'utf8' or 'binary' 
columns (via the $wgDBmysql5 setting), and the default appears to be 
'binary'.[1]


I've come across an old project which followed MediaWiki's lead (literally - 
it cites MediaWiki as the reason) and stores its UTF-8 data in latin1 
tables.  I need to upgrade it to a more modern data infrastructure, but I'm 
hesitant to simply switch to 'utf8' without understanding the reasons for 
this initial implementation decision.


Can anyone confirm that MediaWiki used to behave in this manner, and if so 
why?


If it was due to MySQL bugs, does anyone know in what version these were 
fixed?


Finally, is current best-practice to use 'binary' or 'utf-8' for this?  Why 
does MediaWiki make this configurable?


I have a very good understanding of character encodings and have no problems 
with performing whatever migrations are necessary - and the code itself is 
fully utf-8 compliant except for the database layer - but I'm just trying to 
understand the design choices (or technical limitations) that resulted in 
MediaWiki handling character encodings in this manner.


- Mark Clements (HappyDog)

[1] https://www.mediawiki.org/wiki/Manual:$wgDBmysql5 




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l