Re: Localized alphabetical order
Thanks. I get it now. I meet with our language experts again on Monday. I'll ask them about submitting localization info to the CLDR. Thanks again. -Ben On Fri, Apr 22, 2011 at 2:44 PM, Robert Muir wrote: > On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece wrote: > > What if there is no standard localization already? The case I'm > > specifically interested in is Ojibwe. > > > > this is standard? to sort a field with a specific locale, you have to > tell it the locale you want. if you use the ICU implementation you get > support for more locales, its just that simple. The JRE has less > available locales because its internationalization and localization > support lags behind ICU. > > On the other hand ICU keeps current with both the unicode standard and > locale data in CLDR (http://unicode.org/cldr), which is why it > supports more. > > I noticed there is no locale for your language in CLDR, not even under > development it appears (http://unicode.org/cldr/apps/survey). > > So if your language (Ojibwe) has special sort rules, I recommend > making the collation rules and using a custom collator as specified > here: > http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules > > for your "base collator" you just need to use "new Locale()" and your > rules will be a delta from that. > > Separately, if these sort rules are well-defined/standardized for this > language, and you get them working, you might want to then consider > contributing them to CLDR. >
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece wrote: > What if there is no standard localization already? The case I'm > specifically interested in is Ojibwe. > this is standard? to sort a field with a specific locale, you have to tell it the locale you want. if you use the ICU implementation you get support for more locales, its just that simple. The JRE has less available locales because its internationalization and localization support lags behind ICU. On the other hand ICU keeps current with both the unicode standard and locale data in CLDR (http://unicode.org/cldr), which is why it supports more. I noticed there is no locale for your language in CLDR, not even under development it appears (http://unicode.org/cldr/apps/survey). So if your language (Ojibwe) has special sort rules, I recommend making the collation rules and using a custom collator as specified here: http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules for your "base collator" you just need to use "new Locale()" and your rules will be a delta from that. Separately, if these sort rules are well-defined/standardized for this language, and you get them working, you might want to then consider contributing them to CLDR.
Re: Localized alphabetical order
What if there is no standard localization already? The case I'm specifically interested in is Ojibwe. So should I really be researching how the JRE does localization instead of Solr? On Fri, Apr 22, 2011 at 2:01 PM, Robert Muir wrote: > On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece wrote: > > Thank you. This looks like the right direction. > > > > I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of > > ICUCollationField. So ... I'd implement a subclass of ICUCollationField, > > and use that as the fieldtype in schema.xml. And this means - what? - > that > > I'd also implement a custom SortField to be returned by > > MyCollationField.getSortField(...), which would also require me to write > a > > custom FieldComparator? Am I on the right track? > > no, you don't have to write any code in either case: > > solr 3.1: > > > > > strength="secondary"/> > > > > solr 4.0: > > strength="secondary"/> > > then just copyField or whatever to get your data in there. >
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece wrote: > Thank you. This looks like the right direction. > > I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of > ICUCollationField. So ... I'd implement a subclass of ICUCollationField, > and use that as the fieldtype in schema.xml. And this means - what? - that > I'd also implement a custom SortField to be returned by > MyCollationField.getSortField(...), which would also require me to write a > custom FieldComparator? Am I on the right track? no, you don't have to write any code in either case: solr 3.1: solr 4.0: then just copyField or whatever to get your data in there.
Re: Localized alphabetical order
Thank you. This looks like the right direction. I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of ICUCollationField. So ... I'd implement a subclass of ICUCollationField, and use that as the fieldtype in schema.xml. And this means - what? - that I'd also implement a custom SortField to be returned by MyCollationField.getSortField(...), which would also require me to write a custom FieldComparator? Am I on the right track? Do you know an example of another language which has already done this sort of thing? Really, thanks for your help. -Ben On Fri, Apr 22, 2011 at 11:41 AM, Peter Keegan wrote: > On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece wrote: > > > As someone who's new to Solr/Lucene, I'm having trouble finding > information > > on sorting results in localized alphabetical order. I've ineffectively > > searched the wiki and the mail archives. > > > > I'm thinking for example about Hawai'ian, where mīka (with an i-macron) > > comes after mika (i without the macron) but before miki (also without the > > macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as > > single letters, or about Ojibwe, where the apostrophe ' is a letter which > > sorts between h and i. > > > > How do non-English languages typically handle this? > > > > -Ben > > >
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece wrote: > As someone who's new to Solr/Lucene, I'm having trouble finding information > on sorting results in localized alphabetical order. I've ineffectively > searched the wiki and the mail archives. > > I'm thinking for example about Hawai'ian, where mīka (with an i-macron) > comes after mika (i without the macron) but before miki (also without the > macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as > single letters, or about Ojibwe, where the apostrophe ' is a letter which > sorts between h and i. > > How do non-English languages typically handle this? > > -Ben >
Re: Localized alphabetical order
please see http://wiki.apache.org/solr/UnicodeCollation In general the idea is similar to how this is handled in databases, you can index collation keys into a sort field at analysis time, then you just do a standard solr sort. However, I am not sure if your JRE provides a "haw" Locale for the Hawaiian language. Because of this, its probably better to use the ICU collation integration (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory), because ICU definitely supports this locale and has collation rules for it. On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece wrote: > As someone who's new to Solr/Lucene, I'm having trouble finding information > on sorting results in localized alphabetical order. I've ineffectively > searched the wiki and the mail archives. > > I'm thinking for example about Hawai'ian, where mīka (with an i-macron) > comes after mika (i without the macron) but before miki (also without the > macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as > single letters, or about Ojibwe, where the apostrophe ' is a letter which > sorts between h and i. > > How do non-English languages typically handle this? > > -Ben >
Localized alphabetical order
As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben