Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
Thanks.  I get it now.

I meet with our language experts again on Monday.  I'll ask them about
submitting localization info to the CLDR.

Thanks again.

-Ben

On Fri, Apr 22, 2011 at 2:44 PM, Robert Muir  wrote:

> On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece  wrote:
> > What if there is no standard localization already?  The case I'm
> > specifically interested in is Ojibwe.
> >
>
> this is standard? to sort a field with a specific locale, you have to
> tell it the locale you want. if you use the ICU implementation you get
> support for more locales, its just that simple. The JRE has less
> available locales because its internationalization and localization
> support lags behind ICU.
>
> On the other hand ICU keeps current with both the unicode standard and
> locale data in CLDR (http://unicode.org/cldr), which is why it
> supports more.
>
> I noticed there is no locale for your language in CLDR, not even under
> development it appears (http://unicode.org/cldr/apps/survey).
>
> So if your language (Ojibwe) has special sort rules, I recommend
> making the collation rules and using a custom collator as specified
> here:
> http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules
>
> for your "base collator" you just need to use "new Locale()" and your
> rules will be a delta from that.
>
> Separately, if these sort rules are well-defined/standardized for this
> language, and you get them working, you might want to then consider
> contributing them to CLDR.
>


Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece  wrote:
> What if there is no standard localization already?  The case I'm
> specifically interested in is Ojibwe.
>

this is standard? to sort a field with a specific locale, you have to
tell it the locale you want. if you use the ICU implementation you get
support for more locales, its just that simple. The JRE has less
available locales because its internationalization and localization
support lags behind ICU.

On the other hand ICU keeps current with both the unicode standard and
locale data in CLDR (http://unicode.org/cldr), which is why it
supports more.

I noticed there is no locale for your language in CLDR, not even under
development it appears (http://unicode.org/cldr/apps/survey).

So if your language (Ojibwe) has special sort rules, I recommend
making the collation rules and using a custom collator as specified
here: 
http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

for your "base collator" you just need to use "new Locale()" and your
rules will be a delta from that.

Separately, if these sort rules are well-defined/standardized for this
language, and you get them working, you might want to then consider
contributing them to CLDR.


Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
What if there is no standard localization already?  The case I'm
specifically interested in is Ojibwe.

So should I really be researching how the JRE does localization instead of
Solr?


On Fri, Apr 22, 2011 at 2:01 PM, Robert Muir  wrote:

> On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece  wrote:
> > Thank you.  This looks like the right direction.
> >
> > I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
> > ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
> > and use that as the fieldtype in schema.xml.  And this means - what? -
> that
> > I'd also implement a custom SortField to be returned by
> > MyCollationField.getSortField(...), which would also require me to write
> a
> > custom FieldComparator?  Am I on the right track?
>
> no, you don't have to write any code in either case:
>
> solr 3.1:
>
> 
>  
>
> strength="secondary"/>
>  
> 
>
> solr 4.0:
>
>  strength="secondary"/>
>
> then just copyField or whatever to get your data in there.
>


Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece  wrote:
> Thank you.  This looks like the right direction.
>
> I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
> ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
> and use that as the fieldtype in schema.xml.  And this means - what? - that
> I'd also implement a custom SortField to be returned by
> MyCollationField.getSortField(...), which would also require me to write a
> custom FieldComparator?  Am I on the right track?

no, you don't have to write any code in either case:

solr 3.1:


  


  


solr 4.0:



then just copyField or whatever to get your data in there.


Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
Thank you.  This looks like the right direction.

I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
and use that as the fieldtype in schema.xml.  And this means - what? - that
I'd also implement a custom SortField to be returned by
MyCollationField.getSortField(...), which would also require me to write a
custom FieldComparator?  Am I on the right track?

Do you know an example of another language which has already done this sort
of thing?

Really, thanks for your help.

-Ben

On Fri, Apr 22, 2011 at 11:41 AM, Peter Keegan wrote:

> On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece  wrote:
>
> > As someone who's new to Solr/Lucene, I'm having trouble finding
> information
> > on sorting results in localized alphabetical order. I've ineffectively
> > searched the wiki and the mail archives.
> >
> > I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
> > comes after mika (i without the macron) but before miki (also without the
> > macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
> > single letters, or about Ojibwe, where the apostrophe ' is a letter which
> > sorts between h and i.
> >
> > How do non-English languages typically handle this?
> >
> > -Ben
> >
>


Re: Localized alphabetical order

2011-04-22 Thread Peter Keegan
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece  wrote:

> As someone who's new to Solr/Lucene, I'm having trouble finding information
> on sorting results in localized alphabetical order. I've ineffectively
> searched the wiki and the mail archives.
>
> I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
> comes after mika (i without the macron) but before miki (also without the
> macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
> single letters, or about Ojibwe, where the apostrophe ' is a letter which
> sorts between h and i.
>
> How do non-English languages typically handle this?
>
> -Ben
>


Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
please see http://wiki.apache.org/solr/UnicodeCollation

In general the idea is similar to how this is handled in databases,
you can index collation keys into a sort field at analysis time, then
you just do a standard solr sort.

However, I am not sure if your JRE provides a "haw" Locale for the
Hawaiian language.

Because of this, its probably better to use the ICU collation
integration 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory),
because ICU definitely supports this locale and has collation rules
for it.

On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece  wrote:
> As someone who's new to Solr/Lucene, I'm having trouble finding information
> on sorting results in localized alphabetical order. I've ineffectively
> searched the wiki and the mail archives.
>
> I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
> comes after mika (i without the macron) but before miki (also without the
> macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
> single letters, or about Ojibwe, where the apostrophe ' is a letter which
> sorts between h and i.
>
> How do non-English languages typically handle this?
>
> -Ben
>


Localized alphabetical order

2011-04-22 Thread Ben Preece
As someone who's new to Solr/Lucene, I'm having trouble finding 
information on sorting results in localized alphabetical order. I've 
ineffectively searched the wiki and the mail archives.


I'm thinking for example about Hawai'ian, where mīka (with an i-macron) 
comes after mika (i without the macron) but before miki (also without 
the macron), or about Welsh, where the digraphs (ch, dd, etc.) are 
treated as single letters, or about Ojibwe, where the apostrophe ' is a 
letter which sorts between h and i.


How do non-English languages typically handle this?

-Ben