Re: Unicode::Collate question

2003-12-04 Thread Rich
Sadahiro Tomoyuki wrote:

> 
>> So I guess I need a Ligua:XX::Sort module for each language I operate
>> on,
>> in my original posting I was misled to believe that Unicode::Collate
>> would
>> be the tool to use.
>> 
>> Thanks to all for the useful links provided in this thread.
> 
> As far as I found, CPAN provides at least five modules
> for collation localized for a specific natural language:
> [package name, language name, encoding]
> 
> No::Sort, Norwegian, ISO-8859-1
> http://search.cpan.org/~gaas/Norge-1.07/
> 
> Cz::Sort, Czech, ISO-8859-2
> http://search.cpan.org/~janpaz/Cstools-3.42/
> 
> Lingua::Klingon::Collate, Klingon, ASCII/EBCDIC (Perl native)
> http://search.cpan.org/~pne/Lingua-Klingon-Collate-1.01/
> 
> Lingua::JA::Sort::JIS, Japanese, UTF-8
> http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.04/
> 
> ShiftJIS::Collate, Japanese, Shift-JIS
> http://search.cpan.org/~sadahiro/ShiftJIS-Collate-1.02/
> 
> Regards,
> SADAHIRO Tomoyuki

Has anyone had a look at the OpenI18N/ICU locale data?

The locales there are all UTF-8 and have java rule based collation data, so
they *might* be useful for creating a more comprehensive (and accurate) set
of sort modules? The downside is this data is pretty rough ATM but does
seem to be improving slowly.

I guess p6 is going to use ICU as the basis for I18N - sure hope the APIs
are easier though :)

Cheers
-- 
Rich
[EMAIL PROTECTED]


Re: perlunicode comment - when Unicode does not happen

2003-12-23 Thread Rich
Jarkko Hietaniemi wrote:

> Incidentally, if anyone is interested in helping in getting a new locale
> standard (one can never have too many :-), the CLDR project can always
> use extra eyeballs.  CLDR?  Common Locale Data Repository:
> http://oss.software.ibm.com/cvs/icu/~checkout~/locale/CLDR_status.html

I've spent a bit of time building a locale framework around the CLDR data,
but it's not ready yet. The CLDR stuff is definitely useful, but patchy and
downright wrong in parts - its the best freely available data out there ATM
though. Good to see the ICU stuff going into P6 - hope it will relatively
easy to use though - I hate the APIs!

Anyway, hope to get something out early(ish) in the new year if anyone is
interested - I've already talked to a couple of folks, but more would
always be useful!

Cheers,
-- 
Rich
[EMAIL PROTECTED]


How to use Unicode::Collate in multilinguage apps?

2004-03-26 Thread Rich
Hello

How should collation be handled in multitasking, multilingual applications -
in particular forking servers such as apache/mod_perl based web apps?

I can assume the following:

1) I'll know the preferred language via a RFC2616 language tag.
2) All data will be utf8 encoded Unicode.
3) The required language may differ for each request.

I guess Unicode::Collate is the way to go, so can I simply have one
Unicode::Collate instance per process using the default allkeys.txt table
file? 

Will that give sensible results for most (all?) languages, or do I need to
customise the collator on the fly when more 'exotic' (for want of a better
word) languages are requested? Are there other reasons, such as size and/or
performance issues, why the default allkeys.txt file may not be the way to
go?

I must stress that I'm ok with most aspects of i18n/l10n - it's specifically
the correct use of Unicode::Collate in multitasking apps that I'm
interested in.

Suggestions would be welcome - even more so if they don't involve having to
know the TR10 docs inside out!

Cheers,
-- 
Rich
[EMAIL PROTECTED]


Re: How to use Unicode::Collate in multilinguage apps?

2004-03-30 Thread Rich
Sadahiro Tomoyuki wrote:



> I write Unicode::Collate::Locale (tentatively) for linguistic tailoring
> of UCA. To use it, Unicode::Collate should search allkeys.txt
> from any directories in @iNC (at present it searchs table files
> only under the directory where it locates.)
> So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later,
> which is not released yet, but a prerelease is available as shown below.
> 
> [tarball]
>
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz
> [doc]
> http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html
>Sorry, now tailoring of only few languages are implemented.
>It may be enhanced sooner or later...
> 
> [prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be
> [out.
> http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz

Thank you and Jarkko for your replies.

I now realise that some per-language tailoring would be needed for sensible
results. Unicode::Collate::Locale seems like the kind of think I was
looking for, and any tailoring is better than none :)

Using the multi-lingual server scenario I was initially discussing, would
one of the following usages be correct (yes, it's just pseudocode and
exists in a world where no errors ever occur!):

1)

 my %collators;

 for ( $server_loop )
 {
   my $lang_tag = Server->requested_lang_tag;

   my $collator   = $collators{$lang_tag} 
||= Unicode::Collate::Locale->new(locale => $lang_tag);

   ...
 }


2)

  my $prev_lang;
  my $collator;

  for ( $server_loop )
  {
my $lang_tag = Server->requested_lang_tag;

unless ( $lang_tag eq $prev_lang )
{
  $prev_lang = $lang;
  $collator  = Unicode::Collator::Locale->new(locale => $lang_tag);
}

...
  }


Which would be the preferred way of handling this (or are both wrong)?

Again, thanks for your replies.
-- 
Rich
[EMAIL PROTECTED]


Re: How to use Unicode::Collate in multilinguage apps?

2004-03-31 Thread Rich
Sadahiro Tomoyuki wrote:

> On Mon, 29 Mar 2004 23:44:00 +0100
> Rich <[EMAIL PROTECTED]> wrote:
> 
>> Using the multi-lingual server scenario I was initially discussing, would
>> one of the following usages be correct (yes, it's just pseudocode and
>> exists in a world where no errors ever occur!):
> 
> Though I have not worked with any multitasking application,
> I suppose a possible snag is the size of DUCET (the file named
> allkeys.txt) which should cause slowness of construction of
> a collator and large memory use for storage.

Yes, the size of allkeys.txt is an issue - I did a Data dump of a
Unicode::Collate instance and it's pretty big!

>> 1)
>> 
>>  my %collators;
>> 
>>  for ( $server_loop )
>>  {
>>my $lang_tag = Server->requested_lang_tag;
>> 
>>my $collator   = $collators{$lang_tag}
>> ||= Unicode::Collate::Locale->new(locale => $lang_tag);
>> 
>>...
>>  }
> 
> 1) creates a new collator if $lang_tag value is new.
> Say when the old one was 'en' (English) and the new one was 'it'
> (Italian), Unicode::Collate::Locale->new will return a default collator
> each time. I.e. $collators{en} and $collators{it} work as same but memory
> is not shared.

Good point!

> When Unicode::Collate->new is called, all the data generated by parsing
> of a table file are stored in a collator which is a blessed hash.
> The reason why so is, as I thinked, if (a part of) data newly created
> are stored in other places, say, in a cache at the package namespace
> (e.g. something like %Unicode::Collate::Cache), it might cause some
> problem on handling memory in the cache by users outside the package.
> 
> I think parhaps it should be necessary that a user can determine
> whether two (or more) $lang_tag values create the same collator or not.
> 
> my $lang_tag = Server->requested_lang_tag;
> my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag);
> 
> # if $canonical is same as an old one, the collator for it should be
> # same. After seeing if $canonical is new, a collator can be created.
> # The function name leaves room for reconsideration.

Yes, makes sense, but I'm starting to wonder if Unicode::Collate is too
heavyweight a solution. Perhaps something based around Sort::ArbBiLex might
produce good enough results for most languages.

Thanks for the reply
-- 
Rich
[EMAIL PROTECTED]