Chris Snyder wrote:
Dirty secret - MySQL latin-1 tables will happily store and retrieve
utf-8 data. They won't sort it correctly, though I believe they will
sort it consistently.

So even if your MySQL was compiled without unicode support, you can
put utf-8 in and get utf-8 out.

Of course, if you're going to take the trouble to convert, you should
do it right.
In fact, this is a dirty secret about PHP and the "Unix Way"; to a large extent, systems that are 8-bit clean will process UTF-8 data correctly without modifications... Except when they don't.

Unfortunately, that's also the case with Perl, Java, .NET and other systems that have complex "Unicode Support"; Unicode support is such a complicated thing that it's inevitably implemented with errors in those systems, and often, you're really screwed in those systems because you're not seeing the raw bytestream.

I remember a system where a choice of language and database were made because the systems "supported unicode" according to the documentation, but practically all kinds of strange transformations were going on behind our backs... One day I actually looked at the database in the SQL monitor and found the whole thing was double-encoded.

There's also the issue that there really is no "Unicode Sort Order" that entirely makes sense. For instance, languages such as German and Swedish sort the same characters in a different order. I'm currently working on a system that is predominantly English but contains many named entity names with latinoid characters: the sort order for "English" might well sort the Polish "Dark L" (the l with a line through it) after Z, but poles sort "Dark L" after "Clear L" and most en-speakers will expect that too, since we commonly squash Dark L -> Clear L in words like "Stanislaw." Japanese people sort named entities phonetically, which means you need to keep a furigana (phonetic) representation side by side with the conventional kanji representation... In this age of statistical language translation, I think kanji -> furigana translation could be largely automated, but there are always 'words' that can't be read phonetically out of context, like

"read"

The "real" character encoding that you find web documents in is most closely described as a random mix of ISO-latin-1 and UTF-8 characters interspersed at random, no matter what the charset of a document officially is. There are just too many cases where characters come in through form fields and other sources that aren't well controlled. Yes, ~you~ should publish good clean UTF-8, but if you're scraping on a large scale you'll find lots of crazy stuff that doesn't quite match what's in the books... And it's helpful to look at the byte stream in those cases.
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Reply via email to