Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir. The comment on this change is: move this stuff to language analysis. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=79&rev2=80 -------------------------------------------------- </analyzer> </fieldtype> }}} - - ==== solr.PorterStemFilterFactory ==== - - Creates `org.apache.lucene.analysis.PorterStemFilter`. - - Standard Lucene implementation of the [[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a normalization process that removes common endings from words. - - Example: "riding", "rides", "horses" ==> "ride", "ride", "hors". - - Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`, in that it deviates slightly from the published algorithm. - For more details, see the section "Points of difference from the published algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]]. - - <<Anchor(EnglishPorterFilter)>> - ==== solr.EnglishPorterFilterFactory ==== - - Creates `solr.EnglishPorterFilter`. - - Creates an [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English Porter2 stemmer]] from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification. - - A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by the stemmer. - - A [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt|sample Solr protwords.txt with comments]] can be found in the Source Repository. - - {{{ - <fieldtype name="myfieldtype" class="solr.TextField"> - <analyzer> - <tokenizer class="solr.WhitespaceTokenizerFactory"/> - <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> - </analyzer> - </fieldtype> - }}} - - - <<Anchor(SnowballPorterFilter)>> - ==== solr.SnowballPorterFilterFactory ==== - - Creates `org.apache.lucene.analysis.SnowballPorterFilter`. - - Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]] from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification. The language attribute is used to specify the language of the stemmer. - {{{ - <fieldtype name="myfieldtype" class="solr.TextField"> - <analyzer> - <tokenizer class="solr.WhitespaceTokenizerFactory"/> - <filter class="solr.SnowballPorterFilterFactory" language="German" /> - </analyzer> - </fieldtype> - }}} - - Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"): - * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]] - * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]] - * [[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: The Kraaij-Pohlmann stemming algorithm for Dutch. - * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html|Porter]]: The original Porter stemming algorithm for English. - * [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English]]: The Porter2 stemming algorithm for English. - * [[http://snowball.tartarus.org/algorithms/lovins/stemmer.html|Lovins]]: The early Lovins stemming algorithm for English. - * [[http://snowball.tartarus.org/algorithms/finnish/stemmer.html|Finnish]] - * [[http://snowball.tartarus.org/algorithms/french/stemmer.html|French]] - * [[http://snowball.tartarus.org/algorithms/german/stemmer.html|German]] - * [[http://snowball.tartarus.org/algorithms/german2/stemmer.html|German2]]: A variation of the German algorithm with handling to allow ä, ö and ü to be represented by ae, oe and ue - * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]] - * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]] - * [[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]] - * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]] - * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]] - * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]] - * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]] - * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]] - - <!> Gotchas: - * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it is much slower in Solr, as it is implemented using reflection. - * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]] - * The Turkish stemmer expects properly lowercased terms for correct output, but `LowerCaseFilterFactory` does not lowercase turkish correctly. See [[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]] and [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]]. - * The stemmers are sensitive to diacritics. Think carefully before removing these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results. For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more profound for non-english stemmers. - <<Anchor(WordDelimiterFilter)>> ==== solr.WordDelimiterFilterFactory ====
