[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Apache Wiki Sat, 11 Feb 2006 01:43:05 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by HossMan:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The comment on the change is:
initial import from CNET wiki page PI/AnalyzersTokenizersTokenFilters

New page:
= Analyzers, Tokenizers, and Token Filters =

/!\ :TODO: /!\ Package names are all probably wrong and need fixed

When a document comes in, individual fields are subject to the analyzing and 
tokenizing filters that can transform the data in the fields. For example 
&#151; removing blank spaces, removing html code, stemming, removing a 
particular character and replacing it with another. At collection time as well 
as at query time you may need to do some of the above or similiar operations. 
For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] 
transformation (a type of phonic hashing) on a string to enable a search based 
upon the word and upon its 'sound-alikes'.  

'''Note:''' 
Before continuing with this doc, it's recommended that you read the following 
sections in [http://lucenebook.com/search Lucene In Action]: 
 * 1.5.3 : Analyzer
 * Chapter 4.0 through 4.7 at least 

Try searches for "analyzer", "token", and "stemming".

[[TableOfContents]]


== Stemming ==

Two types of stemming are available to you:
   * [http://tartarus.org/~martin/PorterStemmer/ Porter] or Reduction stemming 
&#151; A transforming algorithm that reduces any of the forms of a word such  
"runs, running, ran", to its elemental root e.g., "run". Porter stemming must 
be performed ''both'' at insertion time and at query time.
   * Expansion stemming &#151; Takes a root word and 'expands' it to all of its 
various forms &#151; can be used ''either'' at insertion time ''or'' at query 
time. 

== Analyzers ==

Analyzers are components that pre-process input text at index time and/or at  
search time. Because a search string has to be processed the same way that the 
indexed text was processed, ''it is important to use the same Analyzer for both 
indexing and searching. Not using the same Analyzer will likely result in 
invalid search results.''  /!\ :TODO: /!\ this isn't really true.. rephrase -YCS

The Analyzer class is an abstract class, but Lucene comes with a few concrete 
Analyzers that pre-process their input in different ways. If you need to 
pre-process input text and queries in a way that is not provided by any of 
Lucene's built-in Analyzers, you will need to implement a custom Analyzer.  

== Tokens and Token Filters ==

An analyzer splits up a text field into tokens that the field is indexed by. An 
Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a 
stream (normally a single field value) into a series of tokens. These tokens 
are then passed through Token Filters that add, change, or remove tokens. The 
field is then indexed by the resulting token stream.

== Specifying an Analyzer in the schema ==

A SOLR schema.xml file allows two methods for specifying they way a text field 
is analyzed. (Normally only fieldtypes of `solr.TextField` will have Analyzers 
explicitly specified in the schema): 

  1.  Specifying the '''class name''' of an Analyzer &#151; anything extending 
org.apache.lucene.analysis.Analyzer. [[BR]] Example: [[BR]] {{{
<fieldtype name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>
}}}
  1.  Specifing a '''Tokenizer''' followed by a list of optional !TokenFilters 
that are applied in the listed order. Factories that can create the tokenizers 
or token filters are used to avoid the overhead of creation via reflection. 
[[BR]] Example: [[BR]] {{{
<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>
}}}

=== TokenizerFactories ===

Solr provides the following  !TokenizerFactories (Tokenizers and !TokenFilters):

==== solr.LetterTokenizerFactory ====

Creates `org.apache.lucene.analysis.LetterTokenizer`. 

Creates tokens consisting of strings of contiguous letters. Any non-letter 
characters will be discarded.

  Example: `"I can't" ==> "I", "can", "t"` 

==== solr.WhitespaceTokenizerFactory ====

Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.

Creates tokens of characters separated by splitting on whitespace. 

==== solr.LowerCaseTokenizerFactory ====

Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.

Creates tokens by lowercasing all letters and dropping non-letters.

  Example: `"I can't" ==> "i", "can", "t"`

==== solr.StandardTokenizerFactory ====

Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.

A good general purpose tokenizer that strips many extraneous characters and 
sets token types to meaningful values.  Token types are only useful for 
subsequent token filters that are type-aware.  The StandardFilter is the only 
Lucene filter that utilizes token type.
   
Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;

  Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", 
APOSTROPHE:"can't"`

==== solr.HTMLStripWhitespaceTokenizerFactory ====

Strips HTML from the input stream and passes the result to a 
!WhitespaceTokenizer.

HTML stripping features:
 * The input need not be an HTML document as only constructs that look like 
HTML will be removed.
 * Removes HTML/XML tags while keeping the content
   * Attributes within tags are also removed, and attribute quoting is optional.
 * Removes XML processing instructions: <?foo bar?>
 * Removes XML comments
 * Removes XML elements starting with <! and ending with > 
 * Removes contents of <script> and <style> elements.
   * Handles XML comments inside these elements (normal comment processing 
won't always work)
   * Replaces numeric character entities references like &#65; or &#x7f;
     * The terminating ';' is optional if the entity reference is followed by 
whitespace.
   * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named 
character entity references].
     * &nbsp; is replaced with a space instead of 0xa0
     * terminating ';' is mandatory to avoid false matches on something like 
"Alpha&Omega Corp" 

HTML stripping examples:

|| my <a href="www.foo.bar">link</a> || my link ||
|| <?xml?><br>hello<!--comment--> || hello ||
|| hello<script><-- f('<--internal--></script>'); --></script> || hello ||
|| if a<b then print a; || if a<b then print a; ||
|| hello <td height=22 nowrap align="left"> || hello ||
|| a&lt;b &#65 Alpha&Omega &Omega; || a<b A Alpha&Omega Î© ||


==== solr.HTMLStripStandardTokenizerFactory ====

Strips HTML from the input stream and passes the result to a !StandardTokenizer.

See solr.HTMLStripWhitespaceTokenizerFactory for details on HTML stripping.

=== TokenFilterFactories ===

==== solr.StandardFilterFactory ====

Creates `org.apache.lucene.analysis.standard.StandardFilter`.

Removes dots from acronyms and 's from the end of tokens. Works only on typed 
tokens, i.e., those produced by !StandardTokenizer or equivalent.

  Example of !StandardTokenizer followed by !StandardFilter:
     `"I.B.M. cat's can't" ==> "IBM", "cat", "can't"`

==== solr.LowerCaseFilterFactory ====

Creates `org.apache.lucene.analysis.LowerCaseFilter`.

Lowercases the letters in each token. Leaves non-letter tokens alone.<br>

  Example: `"I.B.M.", "Solr" ==> "i.b.m.", "solr"`.

==== solr.StopFilterFactory ====

Creates `org.apache.lucene.analysis.StopFilter`.

Discards common words.

The default English stop words are:
{{{
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
}}}
 
A customized stop word list may be specified with the "words" attribute in the 
schema. The file referenced by the words parameter will be loaded by the 
!ClassLoader and hence must be in the classpath.

{{{
<fieldtype name="teststop" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory"/> 
     <filter class="solr.StopFilterFactory" words="stopwords.txt" />
   </analyzer>
</fieldtype>
}}}

==== solr.LengthFilterFactory ====

Creates `solr.LengthFilter`.

Filters out those tokens *not* having length min through max inclusive.
{{{
<fieldtype name="lengthfilt" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="5" />
  </analyzer>
</fieldtype>
}}}

==== solr.PorterStemFilterFactory ====

Creates `org.apache.lucene.analysis.PorterStemFilter`.

Standard Lucene implementation of the     
[http://tartarus.org/~martin/PorterStemmer/ Porter Stemming Algorithm], a 
normalization process that removes common endings from words.

  Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".

==== solr.EnglishPorterFilterFactory ====

Creates `solr.EnglishPorterFilter`.

Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html 
English Porter2 stemmer] from the Java classes generated from a 
[http://snowball.tartarus.org/ Snowball] specification. 

A customized protected word list may be specified with the "protected" 
attribute in the schema. The file referenced will be loaded by the !ClassLoader 
and hence must be in the classpath. Any words in the protected word list will 
not be modified (stemmed).

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.LowerCaseTokenizerFactory"/> 
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
  </analyzer>
</fieldtype>
}}}

'''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses 
reflection to stem every word. 

==== solr.WordDelimiterFilterFactory ====

Creates `solr.analysis.WordDelimiterFilter`.

Splits words into subwords and performs optional transformations on subword 
groups.
Words are split into subwords with the following rules:
 * split on intra-word delimiters (by default, all non alpha-numeric 
characters).
   * `"Wi-Fi" -> "Wi", "Fi"`
 * split on case transitions
   * `"PowerShot" -> "Power", "Shot"`
 * split on letter-number transitions
   * `"SD500" -> "SD", "500"`
 * leading and trailing intra-word delimiters on each subword are ignored
   * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
 * trailing "'s" are removed for each subword
   * `"O'Neil's" -> "O", "Neil"`
     * Note: this step isn't performed in a separate filter because of possible 
subword combinations.

There are a number of parameters that affect what tokens are generated and if 
subwords are combined:
 * '''generateWordParts="1"''' causes parts of words to be generated:
   * `"PowerShot" => "Power" "Shot"`
 * '''generateNumberParts="1"''' causes number subwords to be generated:
   * `"500-42" => "500" "42"`
 * '''catenateWords="1"''' causes maximum runs of word parts to be catenated:
    * `"wi-fi" => "wifi"`
 * '''catenateNumers="1"''' causes maximum runs of number parts to be catenated:
   * `"500-42" => "50042"`
 * '''catenateAll="1"''' causes all subword parts to be catenated:
   * `"wi-fi-4000" => "wifi4000"`

These parameters may be combined in any way.  
 * Example of generateWordParts="1" and  catenateWords="1":
   * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` [[BR]] (where 0,1,1 are 
token positions)
   * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
   * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 
2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`

One use for !WordDelimiterFilter is to help match words with 
[:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters].  One way of 
doing so is to specify `generateWordParts="1" catenateWords="1"` in the 
analyzer used for indexing, and `generateWordParts="1"` in the analyzer used 
for querying.  Given that the current !StandardTokenizer immediately removes 
many intra-word delimiters, it is recommended that this filter be used after a 
tokenizer that leaves them in place (such as !WhitespaceTokenizer). 

{{{
    <fieldtype name="subword" class="solr.TextField">
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"   
                generateNumberParts="1" 
                catenateWords="0"       
                catenateNumbers="0"     
                catenateAll="0"         
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"   
                generateNumberParts="1" 
                catenateWords="1"       
                catenateNumbers="1"     
                catenateAll="0"         
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldtype>
}}}

==== solr.SynonymFilterFactory ====

Creates `SynonymFilter`.

Matches strings of tokens and replaces them with other strings of tokens.

 * The '''synonyms''' parameter names an external file defining the synonyms.
 * If '''ignoreCase''' is true, matching will lowercase before checking 
equality.
 * If '''expand''' is true, a synonym will be expanded to all equivalent 
synonyms.  If it is false, all equivalent synonyms will be reduced to the first 
in the list.

Example usage in schema:
{{{
    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory synonyms="syn.txt" 
ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldtype>
}}}

Synonym file format:
{{{
# blank lines and lines starting with pound are comments.

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS.  These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod, 
sea biscuit, sea biscit => seabiscuit

#Equivalent synonyms may be separated with commas and give
#no explicit mapping.  In this case the mapping behavior will
#be taken from the expand parameter in the schema.  This allows
#the same synonym file to be used in different synonym handling strategies.
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz

}}}

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Reply via email to