On Thu, May 6, 2010 at 1:26 PM, Paul A Houle <p...@devonianfarm.com> wrote:
> There's also the issue that there really is no "Unicode Sort Order" that > entirely makes sense. For instance, languages such as German and Swedish > sort the same characters in a different order. I'm currently working on a > system that is predominantly English but contains many named entity names > with latinoid characters: the sort order for "English" might well sort the > Polish "Dark L" (the l with a line through it) after Z, but poles sort > "Dark L" after "Clear L" and most en-speakers will expect that too, since > we commonly squash Dark L -> Clear L in words like "Stanislaw." Japanese > people sort named entities phonetically, which means you need to keep a > furigana (phonetic) representation side by side with the conventional kanji > representation... In this age of statistical language translation, I think > kanji -> furigana translation could be largely automated, but there are > always 'words' that can't be read phonetically out of context Oh great, locale-specific sorting. To what extent are there unix tools to help you deal with that? When it comes to internationalization, it's always something. At least with utf-8 the encoding can be consistent if you control the inputs (ie, you're not scraping or accepting mishmash from other systems). _______________________________________________ New York PHP Users Group Community Talk Mailing List http://lists.nyphp.org/mailman/listinfo/talk http://www.nyphp.org/Show-Participation