Hi, I tried using mwdumper (latest SVN revision 57818) to import jawiki-20090927-pages-articles.xml [1] into MySQL, but I got an error:
Data too long for column 'rev_comment' The problem is that the xml file contains a revision comment that is 257 bytes long, but the column accepts at most 255 bytes. First I was stumped as to how this could happen, but then I found that on the Wikipedia page, the comment ends with the byte 'e3', while in the xml file it ends with 'ef bf bd'. See [2] for details. I think the cause is something like this: - Comments are truncated to 255 bytes when they are stored. - In this case, this means that a three-byte UTF-8 sequence is cut off after its first byte (hex value e3), so the comment ends with an invalid one-byte UTF-8 sequence. - The dump process has to generate valid UTF-8 (otherwise, most XML parsers wouldn't accept the file), so it replaces the invalid one-byte UTF-8 sequence by the 'replacement character' U+FFFD, which has the three-byte UTF-8 sequence 'ef bf bd'. See [3]. - In this case, the comment grows from 255 bytes to 257 bytes. How to fix this? I think MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes. Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them. Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary. More details can be found at [2]. Bye, Christopher [1] http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-articles.xml.bz2 [2] http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong [3] http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520 _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
