Re: [nyphp-talk] utf-8, iso-8859-1...

Paul A Houle Thu, 06 May 2010 10:26:41 -0700

Chris Snyder wrote:

Dirty secret - MySQL latin-1 tables will happily store and retrieve
utf-8 data. They won't sort it correctly, though I believe they will
sort it consistently.


So even if your MySQL was compiled without unicode support, you can
put utf-8 in and get utf-8 out.

Of course, if you're going to take the trouble to convert, you should
do it right.

In fact, this is a dirty secret about PHP and the "Unix Way"; to alarge extent, systems that are 8-bit clean will process UTF-8 datacorrectly without modifications... Except when they don't.

Unfortunately, that's also the case with Perl, Java, .NET andother systems that have complex "Unicode Support"; Unicode support issuch a complicated thing that it's inevitably implemented with errors inthose systems, and often, you're really screwed in those systemsbecause you're not seeing the raw bytestream.

I remember a system where a choice of language and database weremade because the systems "supported unicode" according to thedocumentation, but practically all kinds of strange transformationswere going on behind our backs... One day I actually looked at thedatabase in the SQL monitor and found the whole thing was double-encoded.

There's also the issue that there really is no "Unicode Sort Order"that entirely makes sense. For instance, languages such as German andSwedish sort the same characters in a different order. I'm currentlyworking on a system that is predominantly English but contains manynamed entity names with latinoid characters: the sort order for"English" might well sort the Polish "Dark L" (the l with a line throughit) after Z, but poles sort "Dark L" after "Clear L" and mosten-speakers will expect that too, since we commonly squash Dark L ->Clear L in words like "Stanislaw." Japanese people sort named entitiesphonetically, which means you need to keep a furigana (phonetic)representation side by side with the conventional kanjirepresentation... In this age of statistical language translation, Ithink kanji -> furigana translation could be largely automated, butthere are always 'words' that can't be read phonetically out ofcontext, like


"read"

The "real" character encoding that you find web documents in is mostclosely described as a random mix of ISO-latin-1 and UTF-8 charactersinterspersed at random, no matter what the charset of a documentofficially is. There are just too many cases where characters come inthrough form fields and other sources that aren't well controlled.Yes, ~you~ should publish good clean UTF-8, but if you're scraping ona large scale you'll find lots of crazy stuff that doesn't quite matchwhat's in the books... And it's helpful to look at the byte stream inthose cases.

_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: [nyphp-talk] utf-8, iso-8859-1...

Reply via email to