[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Apache Wiki Thu, 24 Feb 2011 22:25:18 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: add docs for icu analysis factories.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=109&rev2=110

--------------------------------------------------

      </analyzer>
    </fieldType>
  }}}
+ 
+ === solr.ICUTokenizerFactory ===
+ <!> [[Solr3.1]] Uses [[http://site.icu-project.org/|ICU]]'s text bounds 
capabilities to tokenize text.
+ 
+ This tokenizer first identifies the writing system "Script" for runs of text 
within the document. Then, it tokenizes
+ the text according to rules or dictionaries depending upon the writing 
system. For example, if it encounters
+ Thai, it will apply dictionary-based segmentation to split the Thai text 
(Thai uses no spaces between words).
+ 
+ ||'''Input String'''||'''Output Tokens'''||'''Script Attribute'''||
+ ||Testing บริษัทชื่อ 
נאסק"ר||Testing<<BR>>บริษัท<<BR>>ชื่อ<<BR>>נאסק"ר||Latin<<BR>>Thai<<BR>>Thai<<BR>>Hebrew||
+ 
+ {{{
+     <fieldType name="text_icu" class="solr.TextField" 
autoGeneratePhraseQueries="false">
+       <analyzer>
+         <tokenizer class="solr.ICUTokenizerFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this tokenizer, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
  
  == TokenFilterFactories ==
  
@@ -699, +719 @@

  <<Anchor(CollationKeyFilterFactory)>>
  
  === solr.CollationKeyFilterFactory ===
- <!> [[Solr1.5]]
+ <!> [[Solr3.1]]
  
  A filter that lets one specify:
  
@@ -715, +735 @@

   1. 
[[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's
 CollationKeyFilter javadocs]]
   1. UnicodeCollation
  
+ === solr.ICUCollationKeyFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter works like CollationKeyFilterFactory, except it uses ICU for 
collation. This makes smaller and faster sort keys, and it supports more 
locales. See UnicodeCollation for some more information, the same concepts 
apply.
+ 
+ The only configuration difference is that locales should be specified to this 
filter with RFC 3066 locale IDs.
+ 
+ {{{
+     <fieldType name="icu_sort_en" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.KeywordTokenizerFactory"/>
+         <filter class="solr.ICUCollationKeyFilterFactory" locale="en" 
strength="primary"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUNormalizer2FilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter normalizes text to a [[http://unicode.org/reports/tr15/|Unicode 
Normalization Form]].
+ 
+ {{{
+     <fieldType name="normalized" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" 
mode="compose"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ These are the supported normalization forms: 
+ {{{
+ NFC: name="nfc" mode="compose"
+ NFD: name="nfc" mode="decompose"
+ NFKC: name="nfkc" mode="compose"
+ NFKD: name="nfkc" mode="decompose"
+ NFKC_Casefold: name="nfkc_cf" mode="compose"
+ }}}
+ 
+ NFKC_Casefold (nfkc_cf) means applying the Unicode Case-Folding algorithm in 
conjunction with NFKC normalization. Unicode Case-Folding is more than 
lowercasing, e.g. it handles cases like ß/SS. Behind the scenes this is its own 
form (nfkc_cf), but both algorithms have been recursively computed across all 
of Unicode offline, so that its an efficient single-pass algorithm.
+ For practical purposes this means you can use this factory with nfkc_cf as a 
better substitute for the combined behavior of LowerCaseFilter and NFKC 
normalization.
+ 
+ If you want to do more advanced normalization (e.g. apply a filter to work 
only on a subset of Unicode), see the javadocs.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUFoldingFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter is a custom unicode normalization form that applies the foldings 
specified in [[http://www.unicode.org/reports/tr30/tr30-4.html|UTR#30]] in 
addition to NFKC_Casefold.
+ 
+ {{{
+     <fieldType name="folded" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUFoldingFilterFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ This means NFKC normalization, Unicode case folding, and search term folding 
(removing accents, etc) have been recursively computed across all of Unicode 
offline, so that its an efficient single-pass through the string.
+ For practical purposes this means you can use this factory as a better 
substitute for the combined behavior of ASCIIFoldingFilter, LowerCaseFilter, 
and ICUNormalizer2Filter
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUTransformFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter applies [[http://userguide.icu-project.org/transforms/general|ICU 
Transforms]] to text.
+ 
+ Currently the filter only supports System transforms (or compounds consisting 
of), and custom rulesets are not yet supported.
+ 
+ {{{
+     <fieldType name="transformed" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ You can see a list of the supported System transforms by going to 
[[http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html|this
 link]], clicking the drop-down, and scrolling down to System.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
+

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Reply via email to