Hi,

I tried using mwdumper (latest SVN revision 57818)
to import jawiki-20090927-pages-articles.xml [1]
into MySQL, but I got an error:

Data too long for column 'rev_comment'

The problem is that the xml file contains a revision
comment that is 257 bytes long, but the column
accepts at most 255 bytes.

First I was stumped as to how this could happen,
but then I found that on the Wikipedia page, the
comment ends with the byte 'e3', while in the
xml file it ends with 'ef bf bd'. See [2] for details.

I think the cause is something like this:

- Comments are truncated to 255 bytes when they
are stored.

- In this case, this means that a three-byte UTF-8
sequence is cut off after its first byte (hex value e3),
so the comment ends with an invalid one-byte UTF-8
sequence.

- The dump process has to generate valid UTF-8
(otherwise, most XML parsers wouldn't accept
the file), so it replaces the invalid one-byte UTF-8
sequence by the 'replacement character' U+FFFD,
which has the three-byte UTF-8 sequence 'ef bf bd'.
See [3].

- In this case, the comment grows from 255 bytes
to 257 bytes.

How to fix this? I think MediaWiki should make sure
that a comment contains only valid UTF-8 sequences,
even when it is truncated. This may mean that it
has to be truncated to less than 255 bytes.

Alternatively, the dump process could drop invalid
UTF-8 sequences instead of replacing them.

Yet another fix: mwdumper should make sure
that a comment is at most 255 bytes long and
truncate it if necessary.

More details can be found at [2].

Bye,
Christopher

[1] 
http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-articles.xml.bz2
[2] http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong
[3] http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to