[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by YonikSeeley

Apache Wiki Tue, 25 Jul 2006 13:46:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The comment on the change is:
configurable stemmer languages, latin1 filter

------------------------------------------------------------------------------
  }}}
  
  '''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java 
reflection to stem every word. 
+ 
+ ==== solr.SnowballPorterFilterFactory ====
+ 
+ Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
+ 
+ Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html 
Porter2 stemmer] from the Java classes generated from a 
[http://snowball.tartarus.org/ Snowball] specification.  The language attribute 
is used to specify the language of the stemmer.
+ {{{
+ <fieldtype name="myfieldtype" class="solr.TextField">
+   <analyzer>
+     <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
+     <filter class="solr.SnowballPorterFilterFactory" language="German" />
+   </analyzer>
+ </fieldtype>
+ }}}
+ 
+ Valid values for the language attribute (creates the snowball stemmer class 
language + "Stemmer"):
+  * Danish
+  * Dutch
+  * English
+  * Finnish
+  * French
+  * German2
+  * German
+  * Italian
+  * Kp
+  * Lovins
+  * Norwegian
+  * Porter
+  * Portuguese
+  * Russian
+  * Spanish
+  * Swedish
+ 
  
  ==== solr.WordDelimiterFilterFactory ====
  
@@ -358, +391 @@

     * Many thousands of documents containing the term "text:TV"
     * A few hundred documents containing the term "text:Television"
  
- A query for `text:TV` will expand into `(text:TV text:Television)` and the 
lower docFreq for `text:Television` will give the documents that match 
"Television" a much higher score then docs that match "TV" comparably -- which 
may be somewhat counter intuative to the client.  Index time expansion (or 
reduction) will result in the same idf for all documents regardless of which 
term the orriginal text contained.
+ A query for `text:TV` will expand into `(text:TV text:Television)` and the 
lower docFreq for `text:Television` will give the documents that match 
"Television" a much higher score then docs that match "TV" comparably -- which 
may be somewhat counter intuative to the client.  Index time expansion (or 
reduction) will result in the same idf for all documents regardless of which 
term the original text contained.
  
  ==== solr.RemoveDuplicatesTokenFilterFactory ====
  
@@ -366, +399 @@

  
  Filters out any tokens which are at the same logical position in the 
tokenstream as a previous token with the same text.  This situation can arise 
from a number of situations depending on what the "up stream" token filters are 
-- notably when stemming synonyms with similar roots.  It is usefull to remove 
the duplicates to prevent `idf` inflation at index time, or `tf` inflation (in 
a !MultiPhraseQuery) at query time.
  
+ 
+ ==== solr.ISOLatin1AccentFilterFactory ====
+ 
+ Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`.
+ 
+ Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by 
their unaccented equivalent.
+

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by YonikSeeley

Reply via email to