[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Apache Wiki Wed, 02 Mar 2011 19:10:40 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: add hy, ca, eu, gl and updates for 
analysis-extras contrib.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  Example set of Arabic 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
+ === Armenian ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Armenian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Armenian" />
+ ...
+ }}}
+ 
+ Example set of Armenian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
+ === Basque ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Basque via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Basque" />
+ ...
+ }}}
+ 
+ Example set of Basque 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
  === Brazilian Portuguese ===
  Solr includes a modified version of the Snowball Portuguese algorithm for 
Brazilian Portuguese, and Lucene includes an example stopword list. This 
stemmer handles diacritical marks differently than the European Portuguese 
stemmer.
  
@@ -34, +62 @@

  ... 
  }}}
  
- Example set of Brazilian Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]]
 (Look for BRAZILIAN_STOP_WORDS)
+ Example set of Brazilian Portuguese 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Bulgarian ===
  <!> [[Solr3.1]]
@@ -49, +77 @@

  }}}
  
  Example set of Bulgarian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
+ === Catalan ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Catalan via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
+ ...
+ }}}
+ 
+ Example set of Catalan 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Chinese, Japanese, Korean ===
  Lucene provides support for these languages with CJKTokenizer, which indexes 
bigrams and does some character folding of full-width forms.
  
+ <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support 
for Chinese word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in 
the analysis-extras contrib module. This component includes a large dictionary 
and segments Chinese text into words with the Hidden Markov Model. To use this 
filter, see solr/contrib/analysis-extras/README.txt for instructions on which 
jars you need to add to your SOLR_HOME/lib
+ 
  {{{
     <tokenizer class="solr.CJKTokenizerFactory"/>
  ...
@@ -72, +116 @@

  ...
  }}}
  
- Example set of Czech 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]]
 (Look for CZECH_STOP_WORDS)
+ Example set of Czech 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8))
  
  === Danish ===
  Solr includes support for stemming Danish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
@@ -151, +195 @@

  
  <!> Note: Its probably best to use the ElisionFilter before 
WordDelimiterFilter. This will prevent very slow phrase queries.
  
+ === Galician ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes a stemmer for Galician following this 
[[http://bvg.udc.es/recursos_lingua/stemming.jsp|algorithm]], and Lucene 
includes an example stopword list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.GalicianStemFilterFactory"/>
+ ...
+ }}}
+ 
+ Example set of Galician 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
  === German ===
  Solr includes support for stemming German with five different algorithms: two 
via {{{solr.SnowballPorterFilterFactory}}}, one via 
{{{solr.GermanStemFilterFactory}}}, a lightweight stemmer <!> [[Solr3.1]] via 
{{{solr.GermanLightStemFilterFactory}}}, and an even less aggressive approach 
<!> [[Solr3.1]] via {{{solr.GermanMinimalStemFilterFactory}}}. Lucene includes 
an example stopword list.
  
@@ -241, +299 @@

  
  Example set of Italian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
+ === Lao, Myanmar, Khmer ===
+ <!> [[Solr3.1]]
+ 
+ Lucene provides support for segmenting these languages into syllables with 
{{{solr.ICUTokenizerFactory}}} in the analysis-extras contrib module. To use 
this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on 
which jars you need to add to your SOLR_HOME/lib
+ 
+ <!> Note: Be sure to use PositionFilter at query-time (only) as these 
languages do not use spaces between words. 
+ 
  === Norwegian ===
  Solr includes support for stemming Norwegian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
  
@@ -265, +330 @@

  ...
  }}}
  
- Example set of Persian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
+ Example set of Persian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can 
solve this by using ArabicLetterTokenizerFactory, which does, or by using a 
custom WordDelimiterFilterFactory which supplies a customized charTypeTable to 
WordDelimiterFilter. In either case, consider using PositionFilter at 
query-time (only), as the QueryParser does not consider joiners and could 
create unwanted phrase queries.
  
+ === Polish ===
+ <!> [[Solr3.1]]
+ 
+ Lucene provides support for Polish stemming 
{{{solr.StempelPolishStemFilterFactory}}} in the analysis-extras contrib 
module. This component includes an algorithmic stemmer with tables for Polish.
+ To use this filter, see solr/contrib/analysis-extras/README.txt for 
instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.solr.StempelPolishStemFilterFactory"/>
+ ...
+ }}}
+ 
+ Example set of Polish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
  === Portuguese ===
- Solr includes three stemmers for Portuguese: one via 
{{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] 
via {{{solr.PortugueseLightStemFilterFactory}}}, and an even less aggressive 
approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}. 
Lucene includes an example stopword list.
+ Solr includes four stemmers for Portuguese: one via 
{{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] 
via {{{solr.PortugueseStemFilterFactory}}}, a lighter stemmer <!> [[Solr3.1]] 
via {{{solr.PortugueseLightStemFilterFactory}}}, and an even less aggressive 
approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}. 
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -355, +435 @@

  Example set of Turkish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory 
<!> [[Solr3.1]]
- 
- == Not yet Integrated ==
- 
- The following languages have explicit support in Lucene, but it is not yet 
integrated into Solr. If you need to support these languages you might find 
this information useful in the meantime.
- 
- === Chinese, Japanese, Korean ===
- 
- Lucene provides support for Chinese word segmentation (SentenceTokenizer, 
WordTokenFilter) in a separate jar file (lucene-analyzers-smartcn.jar). This 
component includes a large dictionary and segments Chinese text into words with 
the Hidden Markov Model.
- 
- <!> [[Lucene3.1]]
- 
- Additionally, Lucene provides support for matching between Traditional and 
Simplified Chinese and for matching between Hiragana and Katakana 
(ICUTransformFilter) in a separate jar file (lucene-icu.jar).
- 
- <!> Note: Be sure to use PositionFilter at query-time (only) as this language 
does not use spaces between words.
- 
- === Polish ===
- <!> [[Lucene3.1]]
- 
- Lucene provides support for Polish stemming (StempelFilter) in a separate jar 
file (lucene-analyzers-stempel.jar). This component includes an algorithmic 
stemmer with tables for Polish.
- 
- === Lao, Myanmar, Khmer ===
- <!> [[Lucene3.1]]
- 
- Lucene provides support for segmenting these languages into syllables 
(ICUTokenizer) in a separate jar file (lucene-icu.jar).
- 
- <!> Note: Be sure to use PositionFilter at query-time (only) as these 
languages do not use spaces between words. 
  
  == My language is not listed!!! ==
  
@@ -464, +518 @@

  }}}
  
  Valid values for the language attribute (creates the snowball stemmer class 
language + "Stemmer"):
+  * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armenian]] 
<!> [[Lucene3.1]]
+  * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]] <!> 
[[Lucene3.1]]
+  * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan]] 
<!> [[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
   * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
   * 
[[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: 
The Kraaij-Pohlmann stemming algorithm for Dutch.

[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Reply via email to