[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Apache Wiki Tue, 18 May 2010 09:34:02 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: first cut at improving this documentation.
http://wiki.apache.org/solr/LanguageAnalysis

--------------------------------------------------

New page:
= Language Analysis =

== Overview ==

This page describes some of the language-specific analysis components available 
in Solr. These components can be used to improve search results for specific 
languages.

Please look at 
[[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other 
analysis components you can use in combination with these components.

<<TableOfContents>>

=== By language ===
==== Arabic ====
Solr provides support for the 
[[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]] stemming 
algorithm, and Lucene includes an example stopword list.

This algorithm defines both character normalization and stemming, so these are 
split into two filters to provide more flexibility.

{{{
...
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.ArabicStemFilterFactory"/>
...
}}}

Example set of Arabic 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Brazilian Portuguese ====
Solr includes a modified version of the Snowball Portuguese algorithm for 
Brazilian Portuguese, and Lucene includes an example stopword list. This 
stemmer handles diacritical marks differently than the European Portuguese 
stemmer.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.BrazilianStemFilterFactory"/>
... 
}}}

Example set of Brazilian Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]]
 (Look for BRAZILIAN_STOP_WORDS)

==== Bulgarian ====
<!> [[Solr3.1]]

Solr includes a light stemmer for Bulgarian, following this 
[[http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf|algorithm]], and Lucene 
includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.BulgarianStemFilterFactory"/>
...
}}}

Example set of Bulgarian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Chinese, Japanese, Korean ====
Lucene provides support for these languages with CJKTokenizer, which indexes 
bigrams and does some character folding of full-width forms.

{{{
   <tokenizer class="solr.CJKTokenizerFactory"/>
...
}}}

<!> Note: Be sure to use PositionFilter at query-time (only) as these languages 
do not use spaces between words. 

==== Czech ====
<!> [[Solr3.1]]

Solr includes a light stemmer for Czech, following this 
[[http://portal.acm.org/citation.cfm?id=1598600|algorithm]], and Lucene 
includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.CzechStemFilterFactory"/>
...
}}}

Example set of Czech 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]]
 (Look for CZECH_STOP_WORDS)

==== Danish ====
Solr includes support for stemming Danish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Danish" />
...
}}}

Example set of Danish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Dutch ====
Solr includes two stemmers for Dutch via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
...
}}}

An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the language 
as "Kp".

Example set of Dutch 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== English ====
Solr includes two stemmers for English, the original Porter stemmer via 
{{{solr.PorterStemFilterFactory}}}, and the Porter2 stemmer via 
{{{solr.SnowballPorterFilterFactory}}}, as well as an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PorterStemFilterFactory"/>
...
}}}

<!> Note: The standard {{{PorterStemFilterFactory}}} is significantly faster 
than {{{solr.SnowballPorterFilterFactory}}}.

Larger example set English 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]

==== Finnish ====
Solr includes support for stemming Finnish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Finnish" />
...
}}}

Example set of Finnish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
<!> Note: See also {{{Decompounding}}} below.

==== French ====
Solr includes support for stemming French via 
{{{solr.SnowballPorterFilterFactory}}}, removing elisions via 
ElisionFilterFactory, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.ElisionFilterFactory"/>
  <!-- do word delimiter, etc here -->
  <filter class="solr.SnowballPorterFilterFactory" language="French" />
...
}}}

Example set of French 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: Its probably best to use the ElisionFilter before 
WordDelimiterFilter. This will prevent very slow phrase queries.

==== German ====
Solr includes support for stemming German with three different algorithms: two 
via {{{solr.SnowballPorterFilterFactory}}}, and one via 
{{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword 
list.

With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different 
language attributes: "German" and "German2". German2 is just a modified version 
of German that handles the umlaut characters differently: for example it treats 
"ü" as "ue" in most contexsts.

The {{{solr.GermanStemFilterFactory}}} instead uses a different 
[[http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html|algorithm]].

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="German2" />
...
}}}

Example set of German 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Greek ====
Solr includes support for stemming Greek following this 
[[http://people.dsv.su.se/~hercules/papers/Ntais_greek_stemmer_thesis_final.pdf|algorithm]]
 <!> [[Solr3.1]], as well as support for case/diacritics-insensitive search via 
{{{solr.GreekLowerCaseFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.GreekLowerCaseFilterFactory"/>
  <filter class="solr.GreekStemFilterFactory"/>
...
}}}

Example set of Greek 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory

==== Hindi ====
<!> [[Solr3.1]]

Solr includes support for stemming Hindi following this 
[[http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf|algorithm]],
 support for common spelling differences via 
{{{solr.HindiNormalizationFilterFactory}}} following this 
[[http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf|algorithm]],
 support for encoding differences via 
{{{solr.IndicNormalizationFilterFactory}}} following this 
[[http://ldc.upenn.edu/myl/IndianScriptsUnicode.html|algorithm]], and Lucene 
includes an example stopword list.

{{{
...
  <filter class="solr.IndicNormalizationFilterFactory"/>
  <filter class="solr.HindiNormalizationFilterFactory"/>
  <filter class="solr.HindiStemFilterFactory"/>
...
}}}

Example set of Hindi 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Hungarian ====

Solr includes support for stemming Hungarian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
...
}}}

Example set of Hungarian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Indonesian ====
<!> [[Solr3.1]]

Solr includes support for stemming Indonesian (Bahasa Indonesia) following this 
[[http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf|algorithm]],
 and Lucene includes an example stopword list.

You can set the stemDerivational attribute to false to only stem inflectional 
suffixes, for a lighter approach.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.IndonesianStemFactory" stemDerivational="true" />
...
}}}

Example set of Indonesian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]

==== Italian ====
Solr includes support for stemming Italian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Italian" />
...
}}}

Example set of Italian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Norwegian ====
Solr includes support for stemming Norwegian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Norwegian" />
...
}}}

Example set of Norwegian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Persian / Farsi ====
Solr includes support for normalizing Persian via 
{{{solr.PersianNormalizationFilterFactory}}}, and Lucene includes an example 
stopword list.

{{{
...
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.PersianNormalizationFilterFactory"/>
...
}}}

Example set of Persian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]

<!> Note: WordDelimiterFilter does not split on joiners by default. You can 
solve this by using ArabicLetterTokenizerFactory, which does, or by using a 
custom WordDelimiterFilterFactory which supplies a customized charTypeTable to 
WordDelimiterFilter. In either case, consider using PositionFilter at 
query-time (only), as the QueryParser does not consider joiners and could 
create unwanted phrase queries.

==== Portuguese ====
Solr includes support for stemming Portuguese via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
...
}}}

Example set of Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Romanian ====
Solr includes support for stemming Romanian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
...
}}}

Example set of Romanian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Russian ====
Solr includes support for stemming Russian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
...
}}}

Example set of Russian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Spanish ====
Solr includes support for stemming Spanish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
...
}}}

Example set of Spanish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

==== Swedish ====
Solr includes support for stemming Swedish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Swedish" />
...
}}}

Example set of Swedish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Thai ====
Solr includes support for breaking Thai text into words via 
{{{solr.ThaiWordFilterFactory}}}

{{{
...
  <filter class="solr.ThaiWordFilterFactory"/>
...
}}}

<!> Note: Be sure to use PositionFilter at query-time (only) as this language 
does not use spaces between words.

==== Turkish ====
Solr includes support for stemming Turkish via 
{{{solr.SnowballPorterFilterFactory}}}, as well as support for case-insensitive 
search via {{{solr.TurkishLowerCaseFilterFactory}}} <!> [[Solr3.1]], and Lucene 
includes an example stopword list.

{{{
...
  <filter class="solr.TurkishLowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Turkish" />
...
}}}

Example set of Turkish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)

<!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!> 
[[Solr3.1]]

=== Not yet Integrated ===

The following languages have explicit support in Lucene, but it is not yet 
integrated into Solr. If you need to support these languages you might find 
this information useful in the meantime.

==== Chinese, Japanese, Korean ====

Lucene provides support for Chinese word segmentation (SentenceTokenizer, 
WordTokenFilter) in a separate jar file (lucene-analyzers-smartcn.jar). This 
component includes a large dictionary and segments Chinese text into words with 
the Hidden Markov Model.

<!> [[Lucene3.1]]

Additionally, Lucene provides support for matching between Traditional and 
Simplified Chinese and for matching between Hiragana and Katakana 
(ICUTransformFilter) in a separate jar file (lucene-icu.jar).

<!> Note: Be sure to use PositionFilter at query-time (only) as this language 
does not use spaces between words.

==== Polish ====
<!> [[Lucene3.1]]

Lucene provides support for Polish stemming (StempelFilter) in a separate jar 
file (lucene-analyzers-stempel.jar). This component includes an algorithmic 
stemmer with tables for Polish.

==== Lao, Myanmar, Khmer ====
<!> [[Lucene3.1]]

Lucene provides support for segmenting these languages into syllables 
(ICUTokenizer) in a separate jar file (lucene-icu.jar).

<!> Note: Be sure to use PositionFilter at query-time (only) as these languages 
do not use spaces between words. 

=== My language is not listed!!! ===

Your language might work anyway. A first step is to start with the "textgen" 
type in the example schema. Remember, things like stemming and stopwords aren't 
mandatory for the search to work, only optional language-specific improvements.

If you have problems (your language is highly-inflectional, etc), you might 
want to try using an n-gram approach as an alternative.

=== Tokenization ===

In general most languages don't require special tokenization (and will work 
just fine with Whitespace + WordDelimiterFilter), so you can safely tailor the 
English "text" example schema definition to fit.

=== Ignoring Case ===

In most cases LowerCaseFilterFactory is sufficient. 
However, some languages have special casing properties, and these have their 
own filters:

 * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory 
for the Turkish language. It includes special handling for 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|dotted and dotless I]].
 * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for 
the Greek language. It removes Greek diacritics and has special handling for 
the Greek final sigma.

=== Ignoring Diacritics ===

Some languages use diacritics, but people are not always consistent about 
typing them in queries or documents.

If you are using a stemmer, most stemmers (especially Snowball stemmers) are a 
bit forgiving about diacritics, and these are handled on a language-specific 
basis.

For Latin-script writing systems, you can remove all diacritics with 
ASCIIFoldingFilterFactory. But this might not be the best approach for your 
language, for example you may want ü to match to ue for German. In this case it 
is better to not use ASCIIFoldingFilter before stemming, but instead to use the 
"German2" stemmer first, which has language-specific handling for this case.

For some languages in non-Latin writing systems (Arabic, Greek, Hindi, 
Persian), there are filters to support the idea of "diacritics-insensitive 
search" already included in Solr. These filters are described above under the 
relevant languages.

For other languages, the ASCIIFoldingFilterFactory won't do the foldings that 
you need. One solution is to use the ICUFoldingFilter <!> [[Lucene3.1]], which 
implements a [[http://unicode.org/reports/tr30/tr30-4.html|similar idea]] 
across all of Unicode. Unfortunately, this filter is not yet integrated into 
Solr, so for now you must make the factory yourself.

=== Stopwords ===

Stopwords affect Solr in three ways: relevance, performance, and resource 
utilization.

>From a relevance perspective, these extremely high-frequency terms tend to 
>throw off the scoring algorithm, and you won't get very good results if you 
>leave them. At the same time, if you remove them, you can return bad results 
>when the stopword is actually important.

>From a performance perspective, if you keep stopwords, some queries 
>(especially phrase queries) can be very slow.

>From a resource utilization perspective, if you keep stopwords, the index is 
>much larger than if you remove them.

One tradeoff you can make if you have the disk space: You can use 
CommonGramsFilter/CommonGramsQueryFilter instead of StopFilter. This solves the 
relevance and performance problems, at the expense of even more resource 
utilization, because it will form bigrams of stopwords to their adjacent words.

=== Stemming ===

Stemming can help improve relevance, but it can also hurt.

There is no general rule for whether or not to stem: It depends not only on the 
language, but also on the properties of your documents and queries.

In general, if the language is highly inflectional, its worth evaluating as it 
might bring a significant improvement. Some annoyances caused by stemming can 
then be handled with tuning: See {{{CustomizingStemming}}} below.

==== Notes about solr.PorterStemFilterFactory ====

Porter stemmer for the English language.

Standard Lucene implementation of the 
[[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a 
normalization process that removes common endings from words.

  Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".

Note: This differs very slightly from the "Porter" algorithm available in 
`solr.SnowballPorterFilter`, in that it deviates slightly from the published 
algorithm.
For more details, see the section "Points of difference from the published 
algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]].

This is the fastest stemmer for English: approximately twice as fast as using 
SnowballPorterFilterFactory.

<<Anchor(SnowballPorterFilter)>>
==== Notes about solr.SnowballPorterFilterFactory ====

Creates `org.apache.lucene.analysis.SnowballPorterFilter`.

Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball 
stemmer]] from the Java classes generated from a 
[[http://snowball.tartarus.org/|Snowball]] specification.  The language 
attribute is used to specify the language of the stemmer.
{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
  </analyzer>
</fieldtype>
}}}

Valid values for the language attribute (creates the snowball stemmer class 
language + "Stemmer"):
 * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
 * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
 * [[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: 
The Kraaij-Pohlmann stemming algorithm for Dutch.
 * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html|Porter]]: The 
original Porter stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English]]: 
The Porter2 stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/lovins/stemmer.html|Lovins]]: The 
early Lovins stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/finnish/stemmer.html|Finnish]]
 * [[http://snowball.tartarus.org/algorithms/french/stemmer.html|French]]
 * [[http://snowball.tartarus.org/algorithms/german/stemmer.html|German]]
 * [[http://snowball.tartarus.org/algorithms/german2/stemmer.html|German2]]: A 
variation of the German algorithm with handling to allow ä, ö and ü to be 
represented by ae, oe and ue
 * [[http://snowball.tartarus.org/algorithms/hungarian/stemmer.html|Hungarian]]
 * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
 * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
 * 
[[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
 * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
 * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
 * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
 * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
 * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]

<!> Gotchas:
 * Although the Lovins stemmer is described as faster than Porter/Porter2, 
practically it is much slower in Solr, as it is implemented using reflection.
 * Neither the Lovins nor the Finnish stemmer produce correct output (as of 
Solr 1.4), due to a 
[[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in 
Snowball]]
 * The Turkish stemmer requires special lowercasing. You should use 
TurkishLowerCaseFilter instead of LowerCaseFilter with this language. See 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
 * The stemmers are sensitive to diacritics. Think carefully before removing 
these with something like `ASCIIFoldingFilterFactory` before stemming, as this 
could cause unwanted results. For example, `résumé` will not be stemmed by the 
Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match 
with `resumed`, `resuming`, etc. The differences can be more profound for 
non-english stemmers.

<<Anchor(CustomizingStemming)>>
=== Customizing Stemming ===

Sometimes a stemmer might not do what you want out-of-box. For example, you 
might be happy with the results on average, but have a few particular cases 
(such as Product Names or similar) where it annoys you or actually hurts your 
search results.

The components below allow you to fine-tune the stemming process by preventing 
words from being stemmed at all, or by overriding the stemming algorithm with 
custom mappings.

==== solr.KeywordMarkerFilterFactory ====
<!> [[Solr3.1]]

Protects words from being modified by stemmers.

A customized protected word list may be specified with the "protected" 
attribute in the schema. Any words in the protected word list will not be 
modified by any stemmer in Solr.

A 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt|sample
 Solr protwords.txt with comments]] can be found in the Source Repository.

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
}}}

==== solr.StemmerOverrideFilterFactory ====
<!> [[Solr3.1]]

Overrides stemming algorithms, by applying a custom mapping, then protecting 
these terms from being modified by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be 
specified to the "dictionary" attribute in the schema.  Words in this mapping 
will be stemmed to the stems from the file, and will not be further changed by 
any stemmer.

A 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/test-files/solr/conf/stemdict.txt|sample
 stemdict.txt with comments]] can be found in the Source Repository.

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" 
/>
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
}}}

<<Anchor(Decompounding)>>
=== Decompounding ===

Decompounding can improve search results for some languages. At the same time, 
it can increase the time it takes to index and search, as well as increase the 
index size itself.

Solr provides dictionary-based decompounding support via 
solr.DictionaryCompoundWordTokenFilterFactory. This factory allows you to 
provide a dictionary, along with some settings (min/max subword size, etc), to 
break compound words into pieces.

One alternative is to use n-gram tokenization so that the search is less 
sensitive to compound words.

TODO: Add support for Lucene's hyphenation grammar-based decompounding and 
document it here.

[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Reply via email to