[Solr Wiki] Update of "UnicodeCollation" by RobertMuir

Apache Wiki Wed, 02 Dec 2009 20:23:04 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "UnicodeCollation" page has been changed by RobertMuir.
http://wiki.apache.org/solr/UnicodeCollation

--------------------------------------------------

New page:
= Unicode Collation =
<!> [[Solr1.5]]

== Overview ==
[[http://en.wikipedia.org/wiki/Unicode_collation_algorithm|Unicode Collation]] 
is a method to sort text in a language-sensitive way. It is primarily intended 
for sorting, but can also be used for advanced search purposes.

Unicode Collation in Solr is fast, all the work is done at index time. For more 
information, see the 
[[http://lucene.apache.org/solr/api/org/apache/solr/analysis/CollationKeyFilterFactory.html|Javadocs]].

<<TableOfContents>>

== Sorting text for a specific language ==
In the example below, text will be sorted according to the default German rules 
provided by Java. The rules for sorting German in Java are defined in a package 
called a Java Locale.

Locales are typically defined as a combination of language and country, but you 
can specify just the language if you want. For example, if you specify "de" as 
the language, you will get sorting that works well for German language. If you 
specify "de" as the language and "CH" as the country, you will get German 
sorting specifically tailored for Switzerland.

You can see a list of supported Locales 
[[http://java.sun.com/j2se/1.5.0/docs/guide/intl/locale.doc.html#util-text|here]].

{{{
<!-- define a field type for German collation -->
<fieldType name="collatedGERMAN" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language="de"
        strength="primary"
    />
  </analyzer>
</fieldType>
...
<!-- define a field to store the German collated manufacturer names -->
<field name="manuGERMAN" type="collatedGERMAN" indexed="true" stored="false" />
...
<!-- copy the text to this field. we could create French, English, Spanish 
versions too, and sort differently for different users! -->
<copyField source="manu" dest="manuGERMAN"/>
}}}
In the example above, you will notice we defined the strength as "primary". The 
strength of the collation determines how "picky" the sort order will be, but 
depends upon the language. For example in English, "primary" strength ignores 
differences in case and accents.

For more information, see the 
[[http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html|Collator 
javadocs]].

== Sorting text for multiple languages ==
There are two approaches to supporting multiple languages:

 * If there is a small list, consider defining collated fields for each 
language and using copyField.
 * If there is a very large list, an alternative is to use the "Unicode 
default" collator.

The Unicode default, or "ROOT" Locale, has rules that are designed to work well 
in general for most languages. To use it, simply define the language as the 
empty string.

This Unicode default sort is still significantly more advanced than the 
standard Solr sort.

{{{
<fieldType name="collatedROOT" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language=""
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Sorting text with custom rules ==
For advanced usage, you can define your own set of rules that determine how the 
sorting takes place. Its easiest not to start from scratch, but instead to take 
existing rules that are close to what you want, and "tailor" or customize them.

In the example below, we create a custom ruleset for German known as DIN 
5007-2.  This ruleset treats umlauts in German differently, for example it 
treats ö as equivalent to oe.

For more information, see the 
[[http://java.sun.com/j2se/1.5.0/docs/api/java/text/RuleBasedCollator.html|RuleBasedCollator
 javadocs]].

The example code below shows how to create a custom ruleset and dump it to a 
file.

{{{
    // get the default rules for germany
    // these are called DIN 5007-1 sorting
    RuleBasedCollator baseCollator = (RuleBasedCollator) 
Collator.getInstance(new Locale("de", "DE"));

    // define some tailorings, to make it DIN 5007-2 sorting.
    // For example, this makes ö equivalent to oe
    String DIN5007_2_tailorings =
      "& ae , a\u0308 & AE , A\u0308"+
      "& oe , o\u0308 & OE , O\u0308"+
      "& ue , u\u0308 & UE , u\u0308";

    // concatenate the default rules to the tailorings, and dump it to a String
    RuleBasedCollator tailoredCollator = new 
RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
    String tailoredRules = tailoredCollator.getRules();
    // write these to a file, be sure to use UTF-8 encoding!!!
    IOUtils.write(tailoredRules, new 
FileOutputStream("/solr_home/conf/customRules.dat"), "UTF-8");
}}}
This file of rules can now be used for custom collation in Solr.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Searching ==
For advanced use cases, Collation can be used for search as well, on a 
tokenized field.

In the example below, we use the same custom German rules defined above on a 
tokenized field. Just like when using a stemmer, although the output tokens are 
nonsense, they are the same values and will match for search purposes.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}

Below is an example of what this would look like for two words that should 
match with this collator: Töne and toene.

'''org.apache.solr.analysis.StandardTokenizerFactory'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term 
position ||<class="debugdata">1 ||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text ||<class="debugdata">Töne 
||<class="debugdata">toene ||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM> 
||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 
||<class="debugdata">5,10 ||
||<style="text-align: center;" |1>payload ||<class="debugdata"> 
||<class="debugdata"> ||


'''org.apache.solr.analysis.CollationKeyFilterFactory   {strength=primary, 
custom=customRules.dat}'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term 
position ||<class="debugdata">1 ||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text 
||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM> 
||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 
||<class="debugdata">5,10 ||
||<style="text-align: center;" |1>payload ||<class="debugdata"> 
||<class="debugdata"> ||

Please note that the strange output you see from the filter is really a binary 
collation key encoded in a special form.
What is important is that it is the same value for equivalent tokens as defined 
by that collator.

[Solr Wiki] Update of "UnicodeCollation" by RobertMuir

Reply via email to