Re: Unicode::Collate question
Sadahiro Tomoyuki wrote: > >> So I guess I need a Ligua:XX::Sort module for each language I operate >> on, >> in my original posting I was misled to believe that Unicode::Collate >> would >> be the tool to use. >> >> Thanks to all for the useful links provided in this thread. > > As far as I found, CPAN provides at least five modules > for collation localized for a specific natural language: > [package name, language name, encoding] > > No::Sort, Norwegian, ISO-8859-1 > http://search.cpan.org/~gaas/Norge-1.07/ > > Cz::Sort, Czech, ISO-8859-2 > http://search.cpan.org/~janpaz/Cstools-3.42/ > > Lingua::Klingon::Collate, Klingon, ASCII/EBCDIC (Perl native) > http://search.cpan.org/~pne/Lingua-Klingon-Collate-1.01/ > > Lingua::JA::Sort::JIS, Japanese, UTF-8 > http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.04/ > > ShiftJIS::Collate, Japanese, Shift-JIS > http://search.cpan.org/~sadahiro/ShiftJIS-Collate-1.02/ > > Regards, > SADAHIRO Tomoyuki Has anyone had a look at the OpenI18N/ICU locale data? The locales there are all UTF-8 and have java rule based collation data, so they *might* be useful for creating a more comprehensive (and accurate) set of sort modules? The downside is this data is pretty rough ATM but does seem to be improving slowly. I guess p6 is going to use ICU as the basis for I18N - sure hope the APIs are easier though :) Cheers -- Rich [EMAIL PROTECTED]
Re: perlunicode comment - when Unicode does not happen
Jarkko Hietaniemi wrote: > Incidentally, if anyone is interested in helping in getting a new locale > standard (one can never have too many :-), the CLDR project can always > use extra eyeballs. CLDR? Common Locale Data Repository: > http://oss.software.ibm.com/cvs/icu/~checkout~/locale/CLDR_status.html I've spent a bit of time building a locale framework around the CLDR data, but it's not ready yet. The CLDR stuff is definitely useful, but patchy and downright wrong in parts - its the best freely available data out there ATM though. Good to see the ICU stuff going into P6 - hope it will relatively easy to use though - I hate the APIs! Anyway, hope to get something out early(ish) in the new year if anyone is interested - I've already talked to a couple of folks, but more would always be useful! Cheers, -- Rich [EMAIL PROTECTED]
How to use Unicode::Collate in multilinguage apps?
Hello How should collation be handled in multitasking, multilingual applications - in particular forking servers such as apache/mod_perl based web apps? I can assume the following: 1) I'll know the preferred language via a RFC2616 language tag. 2) All data will be utf8 encoded Unicode. 3) The required language may differ for each request. I guess Unicode::Collate is the way to go, so can I simply have one Unicode::Collate instance per process using the default allkeys.txt table file? Will that give sensible results for most (all?) languages, or do I need to customise the collator on the fly when more 'exotic' (for want of a better word) languages are requested? Are there other reasons, such as size and/or performance issues, why the default allkeys.txt file may not be the way to go? I must stress that I'm ok with most aspects of i18n/l10n - it's specifically the correct use of Unicode::Collate in multitasking apps that I'm interested in. Suggestions would be welcome - even more so if they don't involve having to know the TR10 docs inside out! Cheers, -- Rich [EMAIL PROTECTED]
Re: How to use Unicode::Collate in multilinguage apps?
Sadahiro Tomoyuki wrote: > I write Unicode::Collate::Locale (tentatively) for linguistic tailoring > of UCA. To use it, Unicode::Collate should search allkeys.txt > from any directories in @iNC (at present it searchs table files > only under the directory where it locates.) > So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later, > which is not released yet, but a prerelease is available as shown below. > > [tarball] > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz > [doc] > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html >Sorry, now tailoring of only few languages are implemented. >It may be enhanced sooner or later... > > [prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be > [out. > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz Thank you and Jarkko for your replies. I now realise that some per-language tailoring would be needed for sensible results. Unicode::Collate::Locale seems like the kind of think I was looking for, and any tailoring is better than none :) Using the multi-lingual server scenario I was initially discussing, would one of the following usages be correct (yes, it's just pseudocode and exists in a world where no errors ever occur!): 1) my %collators; for ( $server_loop ) { my $lang_tag = Server->requested_lang_tag; my $collator = $collators{$lang_tag} ||= Unicode::Collate::Locale->new(locale => $lang_tag); ... } 2) my $prev_lang; my $collator; for ( $server_loop ) { my $lang_tag = Server->requested_lang_tag; unless ( $lang_tag eq $prev_lang ) { $prev_lang = $lang; $collator = Unicode::Collator::Locale->new(locale => $lang_tag); } ... } Which would be the preferred way of handling this (or are both wrong)? Again, thanks for your replies. -- Rich [EMAIL PROTECTED]
Re: How to use Unicode::Collate in multilinguage apps?
Sadahiro Tomoyuki wrote: > On Mon, 29 Mar 2004 23:44:00 +0100 > Rich <[EMAIL PROTECTED]> wrote: > >> Using the multi-lingual server scenario I was initially discussing, would >> one of the following usages be correct (yes, it's just pseudocode and >> exists in a world where no errors ever occur!): > > Though I have not worked with any multitasking application, > I suppose a possible snag is the size of DUCET (the file named > allkeys.txt) which should cause slowness of construction of > a collator and large memory use for storage. Yes, the size of allkeys.txt is an issue - I did a Data dump of a Unicode::Collate instance and it's pretty big! >> 1) >> >> my %collators; >> >> for ( $server_loop ) >> { >>my $lang_tag = Server->requested_lang_tag; >> >>my $collator = $collators{$lang_tag} >> ||= Unicode::Collate::Locale->new(locale => $lang_tag); >> >>... >> } > > 1) creates a new collator if $lang_tag value is new. > Say when the old one was 'en' (English) and the new one was 'it' > (Italian), Unicode::Collate::Locale->new will return a default collator > each time. I.e. $collators{en} and $collators{it} work as same but memory > is not shared. Good point! > When Unicode::Collate->new is called, all the data generated by parsing > of a table file are stored in a collator which is a blessed hash. > The reason why so is, as I thinked, if (a part of) data newly created > are stored in other places, say, in a cache at the package namespace > (e.g. something like %Unicode::Collate::Cache), it might cause some > problem on handling memory in the cache by users outside the package. > > I think parhaps it should be necessary that a user can determine > whether two (or more) $lang_tag values create the same collator or not. > > my $lang_tag = Server->requested_lang_tag; > my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag); > > # if $canonical is same as an old one, the collator for it should be > # same. After seeing if $canonical is new, a collator can be created. > # The function name leaves room for reconsideration. Yes, makes sense, but I'm starting to wonder if Unicode::Collate is too heavyweight a solution. Perhaps something based around Sort::ArbBiLex might produce good enough results for most languages. Thanks for the reply -- Rich [EMAIL PROTECTED]