Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir. The comment on this change is: add romanian/turkish, with turkish gotcha, and provide an example for diacritics. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=75&rev2=76 -------------------------------------------------- * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]] * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]] * [[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]] + * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]] * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]] * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]] * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]] + * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]] <!> Gotchas: * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it is much slower in Solr, as it is implemented using reflection. * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]] - * The Non-English stemmers are sensitive to diacritics. Think carefully before removing these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results. + * The Turkish stemmer expects properly lowercased terms for correct output, but `LowerCaseFilterFactory` does not lowercase turkish correctly. See [[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]] and [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]]. + * The stemmers are sensitive to diacritics. Think carefully before removing these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results. For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more profound for non-english stemmers. + <<Anchor(WordDelimiterFilter)>> ==== solr.WordDelimiterFilterFactory ====
