[Solr Wiki] Trivial Update of "LanguageAnalysis" by iorixxx

Apache Wiki Mon, 28 May 2012 04:40:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "LanguageAnalysis" page has been changed by iorixxx:
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=24&rev2=25

Comment:
Tukish stopwords URL was corrected

  = Language Analysis =
- 
  This page describes some of the language-specific analysis components 
available in Solr. These components can be used to improve search results for 
specific languages.
  
- Please look at 
[[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other 
analysis components you can use in combination with these components.
+ Please look at AnalyzersTokenizersTokenFilters for other analysis components 
you can use in combination with these components.
  
- NOTE: This page is mostly '''obsolete'''. The 
[[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr
 Example]] now contains
+ NOTE: This page is mostly '''obsolete'''. The 
[[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr
 Example]] now contains configurations for various languages as fieldTypes 
(text_XX). This is synchronized with the support from Lucene.
- configurations for various languages as fieldTypes (text_XX). This is 
synchronized with the support from Lucene.
  
  <<TableOfContents>>
  
  == By language ==
- 
  === Arabic ===
  Solr provides support for the 
[[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]] stemming 
algorithm, and Lucene includes an example stopword list.
  
@@ -24, +21 @@

    <filter class="solr.ArabicStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Arabic 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Armenian ===
@@ -38, +34 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Armenian" />
  ...
  }}}
- 
  Example set of Armenian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Basque ===
@@ -52, +47 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Basque" />
  ...
  }}}
- 
  Example set of Basque 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Brazilian Portuguese ===
@@ -62, +56 @@

  ...
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.BrazilianStemFilterFactory"/>
- ... 
+ ...
  }}}
- 
  Example set of Brazilian Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Bulgarian ===
@@ -78, +71 @@

    <filter class="solr.BulgarianStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Bulgarian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Catalan ===
@@ -92, +84 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
  ...
  }}}
- 
  Example set of Catalan 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Chinese, Japanese, Korean ===
@@ -102, +93 @@

     <tokenizer class="solr.CJKTokenizerFactory"/>
  ...
  }}}
- 
  <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support 
for Chinese word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in 
the analysis-extras contrib module. This component includes a large dictionary 
and segments Chinese text into words with the Hidden Markov Model. To use this 
filter, see solr/contrib/analysis-extras/README.txt for instructions on which 
jars you need to add to your SOLR_HOME/lib
  
  To use the default setup with fallback to English Porter stemmer for english 
words, use:
+ 
  {{{
     <analyzer 
class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
  }}}
- 
  Or to configure your own analysis setup, use the 
SmartChineseSentenceTokenizerFactory along with your custom filter setup. The 
sentence tokenizer tokenizes on sentence boundaries and the 
SmartChineseWordTokenFilter breaks this further up into words.
+ 
  {{{
    <analyzer>
      <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
@@ -119, +110 @@

      <filter class="solr.PositionFilterFactory" />
    </analyzer>
  }}}
- 
- <!> Note: Be sure to use 
[[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]] 
at query-time (only) as these languages do not use spaces between words. 
+ <!> Note: Be sure to use 
[[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]] 
at query-time (only) as these languages do not use spaces between words.
  
  === Czech ===
  <!> [[Solr3.1]]
@@ -133, +123 @@

    <filter class="solr.CzechStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Czech 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8))
  
  === Danish ===
@@ -145, +134 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Danish" />
  ...
  }}}
- 
  Example set of Danish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -159, +147 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
  ...
  }}}
- 
  An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the 
language as "Kp".
  
  Example set of Dutch 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
@@ -175, +162 @@

    <filter class="solr.PorterStemFilterFactory"/>
  ...
  }}}
- 
  <!> Note: The standard {{{PorterStemFilterFactory}}} is significantly faster 
than {{{solr.SnowballPorterFilterFactory}}}.
  
- Larger example set English 
- 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
+ Larger example set English  
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
  === Finnish ===
  Solr includes two stemmers for Finnish: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
@@ -190, +175 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Finnish" />
  ...
  }}}
- 
  Example set of Finnish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -208, +192 @@

    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  ...
  }}}
- 
  Example set of French 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Its probably best to use the ElisionFilter before 
WordDelimiterFilter. This will prevent very slow phrase queries.
@@ -224, +207 @@

    <filter class="solr.GalicianStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Galician 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === German ===
@@ -240, +222 @@

    <filter class="solr.SnowballPorterFilterFactory" language="German2" />
  ...
  }}}
- 
  Example set of German 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -254, +235 @@

    <filter class="solr.GreekStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Greek 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory
+ 
  === Hebrew ===
- 
  {{{
  ...
    <tokenizer class="solr.ICUTokenizerFactory"/>
  ...
  }}}
  Example set of Hebrew 
[[http://wiki.korotkin.co.il/Hebrew_stopwords|stopwords]] (Be sure to switch 
your browser encoding to UTF-8)
+ 
  === Hindi ===
  <!> [[Solr3.1]]
  
@@ -278, +259 @@

    <filter class="solr.HindiStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Hindi 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Hungarian ===
- 
  Solr includes two stemmers for Hungarian: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
@@ -291, +270 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
  ...
  }}}
- 
  Example set of Hungarian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -309, +287 @@

    <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" />
  ...
  }}}
- 
  Example set of Indonesian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
  === Italian ===
@@ -321, +298 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Italian" />
  ...
  }}}
- 
  Example set of Italian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Lao, Myanmar, Khmer ===
@@ -329, +305 @@

  
  Lucene provides support for segmenting these languages into syllables with 
{{{solr.ICUTokenizerFactory}}} in the analysis-extras contrib module. To use 
this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on 
which jars you need to add to your SOLR_HOME/lib
  
- <!> Note: Be sure to use PositionFilter at query-time (only) as these 
languages do not use spaces between words. 
+ <!> Note: Be sure to use PositionFilter at query-time (only) as these 
languages do not use spaces between words.
  
  === Norwegian ===
  Solr includes support for stemming Norwegian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list. Since <!> [[Solr3.6]] you can also use 
{{{solr.NorwegianLightStemFilterFactory}}} for a lighter variant or 
{{{solr.NorwegianMinimalStemFilterFactory}}} attempting to normalize plural 
endings only. These two are simple rule based stemmers, not handing exceptions 
or irregular forms.
@@ -340, +316 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Norwegian" />
  ...
  }}}
- 
  Example set of Norwegian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -354, +329 @@

    <filter class="solr.PersianNormalizationFilterFactory"/>
  ...
  }}}
- 
  Example set of Persian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can 
solve this by using ArabicLetterTokenizerFactory, which does, or by using a 
custom WordDelimiterFilterFactory which supplies a customized charTypeTable to 
WordDelimiterFilter. In either case, consider using PositionFilter at 
query-time (only), as the QueryParser does not consider joiners and could 
create unwanted phrase queries.
@@ -362, +336 @@

  === Polish ===
  <!> [[Solr3.1]]
  
+ Lucene provides support for Polish stemming 
{{{solr.StempelPolishStemFilterFactory}}} in the analysis-extras contrib 
module. This component includes an algorithmic stemmer with tables for Polish. 
To use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
- Lucene provides support for Polish stemming 
{{{solr.StempelPolishStemFilterFactory}}} in the analysis-extras contrib 
module. This component includes an algorithmic stemmer with tables for Polish.
- To use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
  
  {{{
  ...
@@ -371, +344 @@

    <filter class="solr.solr.StempelPolishStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Polish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Portuguese ===
@@ -383, +355 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
  ...
  }}}
- 
  Example set of Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Romanian ===
@@ -395, +366 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
  ...
  }}}
- 
  Example set of Romanian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Russian ===
@@ -407, +377 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
  ...
  }}}
- 
  Example set of Russian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Spanish ===
@@ -419, +388 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
  ...
  }}}
- 
  Example set of Spanish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Swedish ===
@@ -431, +399 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Swedish" />
  ...
  }}}
- 
  Example set of Swedish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -444, +411 @@

    <filter class="solr.ThaiWordFilterFactory"/>
  ...
  }}}
- 
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language 
does not use spaces between words.
  
  === Turkish ===
@@ -456, +422 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Turkish" />
  ...
  }}}
- 
- Example set of Turkish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ Example set of Turkish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/lang/stopwords_tr.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory 
<!> [[Solr3.1]]
  
  == My language is not listed!!! ==
- 
  Your language might work anyway. A first step is to start with the "textgen" 
type in the example schema. Remember, things like stemming and stopwords aren't 
mandatory for the search to work, only optional language-specific improvements.
  
  If you have problems (your language is highly-inflectional, etc), you might 
want to try using an n-gram approach as an alternative.
  
  == Other Tips ==
  === Tokenization ===
- 
  In general most languages don't require special tokenization (and will work 
just fine with Whitespace + WordDelimiterFilter), so you can safely tailor the 
English "text" example schema definition to fit.
  
  === Ignoring Case ===
- 
- In most cases LowerCaseFilterFactory is sufficient. 
- However, some languages have special casing properties, and these have their 
own filters:
+ In most cases LowerCaseFilterFactory is sufficient.  However, some languages 
have special casing properties, and these have their own filters:
  
   * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory 
for the Turkish language. It includes special handling for 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|dotted and dotless I]].
   * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory 
for the Greek language. It removes Greek diacritics and has special handling 
for the Greek final sigma.
  
  === Ignoring Diacritics ===
- 
  Some languages use diacritics, but people are not always consistent about 
typing them in queries or documents.
  
  If you are using a stemmer, most stemmers (especially Snowball stemmers) are 
a bit forgiving about diacritics, and these are handled on a language-specific 
basis.
@@ -493, +453 @@

  For other languages, the ASCIIFoldingFilterFactory won't do the foldings that 
you need. One solution is to use {{{solr.analysis.ICUFoldingFilterFactory}}} 
<!> [[Solr3.1]], which implements a 
[[http://unicode.org/reports/tr30/tr30-4.html|similar idea]] across all of 
Unicode
  
  === Stopwords ===
- 
  Stopwords affect Solr in three ways: relevance, performance, and resource 
utilization.
  
  From a relevance perspective, these extremely high-frequency terms tend to 
throw off the scoring algorithm, and you won't get very good results if you 
leave them. At the same time, if you remove them, you can return bad results 
when the stopword is actually important.
@@ -505, +464 @@

  One tradeoff you can make if you have the disk space: You can use 
CommonGramsFilter/CommonGramsQueryFilter instead of StopFilter. This solves the 
relevance and performance problems, at the expense of even more resource 
utilization, because it will form bigrams of stopwords to their adjacent words.
  
  === Stemming ===
- 
  Stemming can help improve relevance, but it can also hurt.
  
  There is no general rule for whether or not to stem: It depends not only on 
the language, but also on the properties of your documents and queries.
  
- Lucene/Solr provides different stemmers, and for some languages you may have 
multiple choices. Some are algorithmic based, others are dictionary based. 
+ Lucene/Solr provides different stemmers, and for some languages you may have 
multiple choices. Some are algorithmic based, others are dictionary based.
  
  The Snowball stemmers rely on algorithms and considered fairly aggressive, 
but for many languages (see above) Solr provides alternatives that are less 
aggressive. In many situations a lighter approach yields better relevance: 
often "less is more". The light stemmers typically target the most common 
noun/adjective inflections, and perhaps a few derivational suffixes. The 
minimal stemmers are even more conservative and may only remove plural endings. 
The new Hunspell stemmers are both dictionary and rule based and may provide a 
tighter stemming than Snowball for some languages.
  
@@ -522, +480 @@

  <!> [[Solr3.5]] The Hunspell stemmers are configured through the 
HunspellStemFilterFactory combined with a dictionary and an affix file. 
Hunspell supports 99 languages.
  
  ==== Notes about solr.PorterStemFilterFactory ====
- 
  Porter stemmer for the English language.
  
  Standard Lucene implementation of the 
[[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a 
normalization process that removes common endings from words.
  
-   Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
+  . Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
  
+ Note: This differs very slightly from the "Porter" algorithm available in 
`solr.SnowballPorterFilter`, in that it deviates slightly from the published 
algorithm. For more details, see the section "Points of difference from the 
published algorithm" described 
[[http://tartarus.org/~martin/PorterStemmer/|here]].
- Note: This differs very slightly from the "Porter" algorithm available in 
`solr.SnowballPorterFilter`, in that it deviates slightly from the published 
algorithm.
- For more details, see the section "Points of difference from the published 
algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]].
  
  Porter is approximately twice as fast as using SnowballPorterFilterFactory.
  
@@ -540, +496 @@

  KStem is considerably faster than SnowballPorterFilterFactory.
  
  <<Anchor(SnowballPorterFilter)>>
+ 
  ==== Notes about solr.SnowballPorterFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
  
  Creates an 
[[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]] 
from the Java classes generated from a 
[[http://snowball.tartarus.org/|Snowball]] specification.  The language 
attribute is used to specify the language of the stemmer.
+ 
  {{{
  <fieldtype name="myfieldtype" class="solr.TextField">
    <analyzer>
@@ -553, +510 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  Valid values for the language attribute (creates the snowball stemmer class 
language + "Stemmer"):
+ 
   * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armenian]] 
<!> [[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]] <!> 
[[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan]] 
<!> [[Lucene3.1]]
@@ -579, +536 @@

   * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
  
  <!> Gotchas:
+ 
   * Although the Lovins stemmer is described as faster than Porter/Porter2, 
practically it is much slower in Solr, as it is implemented using reflection.
   * Neither the Lovins nor the Finnish stemmer produce correct output (as of 
Solr 1.4), due to a 
[[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in 
Snowball]]
   * The Turkish stemmer requires special lowercasing. You should use 
TurkishLowerCaseFilter instead of LowerCaseFilter with this language. See 
[[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
   * The stemmers are sensitive to diacritics. Think carefully before removing 
these with something like `ASCIIFoldingFilterFactory` before stemming, as this 
could cause unwanted results. For example, `résumé` will not be stemmed by the 
Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match 
with `resumed`, `resuming`, etc. The differences can be more profound for 
non-english stemmers.
  
  <<Anchor(CustomizingStemming)>>
+ 
  === Customizing Stemming ===
- 
  Sometimes a stemmer might not do what you want out-of-box. For example, you 
might be happy with the results on average, but have a few particular cases 
(such as Product Names or similar) where it annoys you or actually hurts your 
search results.
  
  The components below allow you to fine-tune the stemming process by 
preventing words from being stemmed at all, or by overriding the stemming 
algorithm with custom mappings.
@@ -609, +567 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  ==== solr.StemmerOverrideFilterFactory ====
  <!> [[Solr3.1]]
  
@@ -628, +585 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  <<Anchor(Decompounding)>>
+ 
  === Decompounding ===
- 
  Decompounding can improve search results for some languages. At the same 
time, it can increase the time it takes to index and search, as well as 
increase the index size itself.
  
  Solr provides dictionary-based decompounding support via 
solr.DictionaryCompoundWordTokenFilterFactory. This factory allows you to 
provide a dictionary, along with some settings (min/max subword size, etc), to 
break compound words into pieces.

[Solr Wiki] Trivial Update of "LanguageAnalysis" by iorixxx

Reply via email to