[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Apache Wiki Tue, 18 May 2010 09:35:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: move this stuff to language analysis.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=79&rev2=80

--------------------------------------------------

    </analyzer>
  </fieldtype>
  }}}
- 
- ==== solr.PorterStemFilterFactory ====
- 
- Creates `org.apache.lucene.analysis.PorterStemFilter`.
- 
- Standard Lucene implementation of the 
[[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a 
normalization process that removes common endings from words.
- 
-   Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
- 
- Note: This differs very slightly from the "Porter" algorithm available in 
`solr.SnowballPorterFilter`, in that it deviates slightly from the published 
algorithm.
- For more details, see the section "Points of difference from the published 
algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]].
- 
- <<Anchor(EnglishPorterFilter)>>
- ==== solr.EnglishPorterFilterFactory ====
- 
- Creates `solr.EnglishPorterFilter`.
- 
- Creates an 
[[http://snowball.tartarus.org/algorithms/english/stemmer.html|English Porter2 
stemmer]] from the Java classes generated from a 
[[http://snowball.tartarus.org/|Snowball]] specification.
- 
- A customized protected word list may be specified with the "protected" 
attribute in the schema. Any words in the protected word list will not be 
modified by the stemmer.
- 
- A 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt|sample
 Solr protwords.txt with comments]] can be found in the Source Repository.
- 
- {{{
- <fieldtype name="myfieldtype" class="solr.TextField">
-   <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" 
/>
-   </analyzer>
- </fieldtype>
- }}}
- 
- 
- <<Anchor(SnowballPorterFilter)>>
- ==== solr.SnowballPorterFilterFactory ====
- 
- Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
- 
- Creates an 
[[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]] 
from the Java classes generated from a 
[[http://snowball.tartarus.org/|Snowball]] specification.  The language 
attribute is used to specify the language of the stemmer.
- {{{
- <fieldtype name="myfieldtype" class="solr.TextField">
-   <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-     <filter class="solr.SnowballPorterFilterFactory" language="German" />
-   </analyzer>
- </fieldtype>
- }}}
- 
- Valid values for the language attribute (creates the snowball stemmer class 
language + "Stemmer"):
-  * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
-  * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
-  * 
[[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: 
The Kraaij-Pohlmann stemming algorithm for Dutch.
-  * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html|Porter]]: 
The original Porter stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English]]: 
The Porter2 stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/lovins/stemmer.html|Lovins]]: 
The early Lovins stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/finnish/stemmer.html|Finnish]]
-  * [[http://snowball.tartarus.org/algorithms/french/stemmer.html|French]]
-  * [[http://snowball.tartarus.org/algorithms/german/stemmer.html|German]]
-  * [[http://snowball.tartarus.org/algorithms/german2/stemmer.html|German2]]: 
A variation of the German algorithm with handling to allow ä, ö and ü to be 
represented by ae, oe and ue
-  * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
-  * 
[[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
-  * 
[[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
-  * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
-  * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
-  * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
-  * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
-  * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
- 
- <!> Gotchas:
-  * Although the Lovins stemmer is described as faster than Porter/Porter2, 
practically it is much slower in Solr, as it is implemented using reflection.
-  * Neither the Lovins nor the Finnish stemmer produce correct output (as of 
Solr 1.4), due to a 
[[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in 
Snowball]]
-  * The Turkish stemmer expects properly lowercased terms for correct output, 
but `LowerCaseFilterFactory` does not lowercase turkish correctly. See 
[[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]] and 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
-  * The stemmers are sensitive to diacritics. Think carefully before removing 
these with something like `ASCIIFoldingFilterFactory` before stemming, as this 
could cause unwanted results. For example, `résumé` will not be stemmed by the 
Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match 
with `resumed`, `resuming`, etc. The differences can be more profound for 
non-english stemmers.
- 
  
  <<Anchor(WordDelimiterFilter)>>
  ==== solr.WordDelimiterFilterFactory ====

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Reply via email to