[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Apache Wiki Fri, 05 Feb 2010 07:48:27 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: add romanian/turkish, with turkish gotcha, and 
provide an example for diacritics.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=75&rev2=76

--------------------------------------------------

   * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
   * 
[[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
   * 
[[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
+  * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
   * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
   * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
   * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
+  * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
  
  <!> Gotchas:
   * Although the Lovins stemmer is described as faster than Porter/Porter2, 
practically it is much slower in Solr, as it is implemented using reflection.
   * Neither the Lovins nor the Finnish stemmer produce correct output (as of 
Solr 1.4), due to a 
[[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in 
Snowball]]
-  * The Non-English stemmers are sensitive to diacritics. Think carefully 
before removing these with something like `ASCIIFoldingFilterFactory` before 
stemming, as this could cause unwanted results.
+  * The Turkish stemmer expects properly lowercased terms for correct output, 
but `LowerCaseFilterFactory` does not lowercase turkish correctly. See 
[[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]] and 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
+  * The stemmers are sensitive to diacritics. Think carefully before removing 
these with something like `ASCIIFoldingFilterFactory` before stemming, as this 
could cause unwanted results. For example, `résumé` will not be stemmed by the 
Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match 
with `resumed`, `resuming`, etc. The differences can be more profound for 
non-english stemmers.
+ 
  
  <<Anchor(WordDelimiterFilter)>>
  ==== solr.WordDelimiterFilterFactory ====

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Reply via email to