[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by MichaelSembWever

Apache Wiki Sat, 31 Jan 2009 07:54:36 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by MichaelSembWever:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

------------------------------------------------------------------------------
  
  == Overview ==
  
- When a document is indexed, its individual fields are subject to the 
analyzing and tokenizing filters that can transform and normalize the data in 
the fields. For example &#151; removing blank spaces, removing html code, 
stemming, removing a particular character and replacing it with another. At 
indexing time as well as at query time you may need to do some of the above or 
similiar operations. For example, you might perform a 
[http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic 
hashing) on a string to enable a search based upon the word and upon its 
'sound-alikes'.  
+ When a document is indexed, its individual fields are subject to the 
analyzing and tokenizing filters that can transform and normalize the data in 
the fields. For example &#151; removing blank spaces, removing html code, 
stemming, removing a particular character and replacing it with another. At 
indexing time as well as at query time you may need to do some of the above or 
similiar operations. For example, you might perform a 
[http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic 
hashing) on a string to enable a search based upon the word and upon its 
'sound-alikes'.
  
  The lists below provide an overview of '''''some''''' of the more heavily 
used Tokenizers and !TokenFilters provided by Solr "out of the box" along with 
tips/examples of using them.  '''This list should by no means be considered the 
"complete" list of all Analysis classes available in Solr!'''  In addition to 
new classes being added on an ongoing basis, you can load your own custom 
Analysis code as a [wiki:SolrPlugins Plugin].
  
  For a more complete list of what Tokenizers and !TokenFilters come out of the 
box, please consult the 
[http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html
 javadocs] for the analysis package.  if you have any tips/tricks you'd like to 
mention about using any of these classes, please add them below.
  
- '''Note:''' 
+ '''Note:'''
- For a good background on Lucene Analysis, it's recommended that you read the 
following sections in [http://lucenebook.com/ Lucene In Action]: 
+ For a good background on Lucene Analysis, it's recommended that you read the 
following sections in [http://lucenebook.com/ Lucene In Action]:
   * 1.5.3 : Analyzer
-  * Chapter 4.0 through 4.7 at least 
+  * Chapter 4.0 through 4.7 at least
  
  Try searches for "analyzer", "token", and "stemming".
  
@@ -41, +41 @@

  
  An analyzer splits up a text field into tokens that the field is indexed by. 
An Analyzer is normally implemented by creating a '''Tokenizer''' that 
splits-up a stream (normally a single field value) into a series of tokens. 
These tokens are then passed through a series of Token Filters that add, 
change, or remove tokens. The field is then indexed by the resulting token 
stream.
  
- The Solr web admin interface may be used to show the results of text 
analysis, and even the results after each analysis phase if a custom analyzer 
is used. 
+ The Solr web admin interface may be used to show the results of text 
analysis, and even the results after each analysis phase if a custom analyzer 
is used.
  
  == Specifying an Analyzer in the schema ==
  
- A Solr schema.xml file allows two methods for specifying the way a text field 
is analyzed. (Normally only field types of `solr.TextField` will have Analyzers 
explicitly specified in the schema): 
+ A Solr schema.xml file allows two methods for specifying the way a text field 
is analyzed. (Normally only field types of `solr.TextField` will have Analyzers 
explicitly specified in the schema):
  
    1.  Specifying the '''class name''' of an Analyzer &#151; anything 
extending org.apache.lucene.analysis.Analyzer. [[BR]] Example: [[BR]] {{{
  <fieldtype name="nametext" class="solr.TextField">
@@ -89, +89 @@

  
  ==== solr.LetterTokenizerFactory ====
  
- Creates `org.apache.lucene.analysis.LetterTokenizer`. 
+ Creates `org.apache.lucene.analysis.LetterTokenizer`.
  
  Creates tokens consisting of strings of contiguous letters. Any non-letter 
characters will be discarded.
  
-   Example: `"I can't" ==> "I", "can", "t"` 
+   Example: `"I can't" ==> "I", "can", "t"`
  
  [[Anchor(WhitespaceTokenizer)]]
  ==== solr.WhitespaceTokenizerFactory ====
  
  Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
  
- Creates tokens of characters separated by splitting on whitespace. 
+ Creates tokens of characters separated by splitting on whitespace.
  
  ==== solr.LowerCaseTokenizerFactory ====
  
@@ -116, +116 @@

  Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
  
  A good general purpose tokenizer that strips many extraneous characters and 
sets token types to meaningful values.  Token types are only useful for 
subsequent token filters that are type-aware.  The !StandardFilter is currently 
the only Lucene filter that utilizes token types.
-    
+ 
  Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
  
    Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", 
APOSTROPHE:"can't"`
@@ -132, +132 @@

     * Attributes within tags are also removed, and attribute quoting is 
optional.
   * Removes XML processing instructions: <?foo bar?>
   * Removes XML comments
-  * Removes XML elements starting with <! and ending with > 
+  * Removes XML elements starting with <! and ending with >
   * Removes contents of <script> and <style> elements.
     * Handles XML comments inside these elements (normal comment processing 
won't always work)
     * Replaces numeric character entities references like {{{&#65;}}} or 
{{{&#x7f;}}}
       * The terminating ';' is optional if the entity reference is followed by 
whitespace.
     * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named 
character entity references].
       * &nbsp; is replaced with a space instead of 0xa0
-      * terminating ';' is mandatory to avoid false matches on something like 
"Alpha&Omega Corp" 
+      * terminating ';' is mandatory to avoid false matches on something like 
"Alpha&Omega Corp"
  
  HTML stripping examples:
  
@@ -202, +202 @@

  
  <!> ["Solr1.2"]
  
- Creates `org.apache.solr.analysis.TrimFilter`. 
+ Creates `org.apache.solr.analysis.TrimFilter`.
  
  Trims whitespace at either end of a token.
  
    Example: `" Kittens!   ", "Duck" ==> "Kittens!", "Duck"`.
  
- Optionally, the "updateOffsets" attribute will update the start and end 
position offsets. 
+ Optionally, the "updateOffsets" attribute will update the start and end 
position offsets.
  
  
  [[Anchor(StopFilter)]]
@@ -226, +226 @@

      "t", "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"
  }}}
-  
+ 
  A customized stop word list may be specified with the "words" attribute in 
the schema.
- Optionally, the "ignoreCase" attribute may be used to ignore the case of 
tokens when comparing to the stopword list. 
+ Optionally, the "ignoreCase" attribute may be used to ignore the case of 
tokens when comparing to the stopword list.
  
  {{{
  <fieldtype name="teststop" class="solr.TextField">
     <analyzer>
-      <tokenizer class="solr.LowerCaseTokenizerFactory"/> 
+      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
     </analyzer>
  </fieldtype>
@@ -284, +284 @@

  
  Creates `solr.EnglishPorterFilter`.
  
- Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html 
English Porter2 stemmer] from the Java classes generated from a 
[http://snowball.tartarus.org/ Snowball] specification. 
+ Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html 
English Porter2 stemmer] from the Java classes generated from a 
[http://snowball.tartarus.org/ Snowball] specification.
  
  A customized protected word list may be specified with the "protected" 
attribute in the schema. Any words in the protected word list will not be 
modified by the stemmer.
  
@@ -293, +293 @@

  {{{
  <fieldtype name="myfieldtype" class="solr.TextField">
    <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
+     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" 
/>
    </analyzer>
  </fieldtype>
  }}}
  
- '''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java 
reflection to stem every word. 
+ '''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java 
reflection to stem every word.
  
  [[Anchor(SnowballPorterFilter)]]
  ==== solr.SnowballPorterFilterFactory ====
@@ -310, +310 @@

  {{{
  <fieldtype name="myfieldtype" class="solr.TextField">
    <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
+     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German" />
    </analyzer>
  </fieldtype>
@@ -373, +373 @@

   * '''catenateAll="1"''' causes all subword parts to be catenated:
     * `"wi-fi-4000" => "wifi4000"`
  
- These parameters may be combined in any way.  
+ These parameters may be combined in any way.
   * Example of generateWordParts="1" and  catenateWords="1":
     * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` [[BR]] (where 0,1,1 
are token positions)
     * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
     * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 
2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`
  
- One use for !WordDelimiterFilter is to help match words with 
[:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters].  One way of 
doing so is to specify `generateWordParts="1" catenateWords="1"` in the 
analyzer used for indexing, and `generateWordParts="1"` in the analyzer used 
for querying.  Given that the current !StandardTokenizer immediately removes 
many intra-word delimiters, it is recommended that this filter be used after a 
tokenizer that leaves them in place (such as !WhitespaceTokenizer). 
+ One use for !WordDelimiterFilter is to help match words with 
[:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters].  One way of 
doing so is to specify `generateWordParts="1" catenateWords="1"` in the 
analyzer used for indexing, and `generateWordParts="1"` in the analyzer used 
for querying.  Given that the current !StandardTokenizer immediately removes 
many intra-word delimiters, it is recommended that this filter be used after a 
tokenizer that leaves them in place (such as !WhitespaceTokenizer).
  
  {{{
      <fieldtype name="subword" class="solr.TextField">
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
-                 generateWordParts="1"   
+                 generateWordParts="1"
-                 generateNumberParts="1" 
+                 generateNumberParts="1"
-                 catenateWords="0"       
+                 catenateWords="0"
-                 catenateNumbers="0"     
+                 catenateNumbers="0"
-                 catenateAll="0"         
+                 catenateAll="0"
                  />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory"/>
@@ -399, +399 @@

        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
-                 generateWordParts="1"   
+                 generateWordParts="1"
-                 generateNumberParts="1" 
+                 generateNumberParts="1"
-                 catenateWords="1"       
+                 catenateWords="1"
-                 catenateNumbers="1"     
+                 catenateNumbers="1"
-                 catenateAll="0"         
+                 catenateAll="0"
                  />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory"/>
@@ -441, +441 @@

  #and replace with all alternatives on the RHS.  These types of mappings
  #ignore the expand parameter in the schema.
  #Examples:
- i-pod, i pod => ipod, 
+ i-pod, i pod => ipod,
  sea biscuit, sea biscit => seabiscuit
  
  #Equivalent synonyms may be separated with commas and give
@@ -521, +521 @@

  
  A ShingleFilter constructs shingles (token n-grams) from a token stream. In 
other words, it creates combinations of tokens as a single token.
  
- For example, the sentence "please divide this sentence into shingles" might 
be tokenized into shingles "please divide", "divide this", "this sentence", 
"sentence into", and "into shingles". 
+ For example, the sentence "please divide this sentence into shingles" might 
be tokenized into shingles "please divide", "divide this", "this sentence", 
"sentence into", and "into shingles".
  
  
  || '''arg''' || '''value''' ||
@@ -532, +532 @@

    <filter class="solr.ShingleFilterFactory" maxShingleSize="true" 
outputUnigrams="true"/>
  }}}
  
+ 
+ 
+ [[Anchor(PositionFilterFactory)]]
+ ==== solr.PositionFilterFactory ====
+ 
+ <!> ["Solr1.4"]
+ 
+ Creates 
[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html
 org.apache.lucene.analysis.position.PositionFilter].
+ 
+ A PositionFilter manipulates the position of tokens in the stream.
+ 
+ Set the positionIncrement of all tokens to the "positionIncrement", except 
the first return token which retains its original positionIncrement value.
+ 
+ || '''arg''' || '''value''' ||
+ || positionIncrement || default 0 ||
+ 
+ {{{
+   <filter class="solr.PositionFilterFactory" />
+ }}}
+ 
+ An example is when exact matching hits are wanted for _any_ shingle within 
the query. (This was done at http://sesam.no to replace three proprietary 'FAST 
Query-Matching servers' with two open sourced Solr indexes, background reading 
in [http://sesat.no/howto-solr-query-evaluation.html sesat] and on the 
[http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 mailing list]).
+ It was needed that in the query all words and shingles to be placed at the 
same position, so that all shingles to be treated as synonyms of each other.
+ 
+ With only the ShingleFilter the shingles generated are synonyms only to the 
first term in each shingle group.
+ For example the query "abcd efgh ijkl" results in a query like:
+   ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
+ where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
ijkl" is a synonym of "efgh".
+ 
+ ShingleFilter does not offer a way to alter this behaviour.
+ 
+ Using the PositionFilter in combination makes it possible to make all 
shingles synonyms of each other.
+ Such a configuration could look like:
+ {{{
+    <fieldType name="shingleString" class="solr.TextField" 
positionIncrementGap="100" omitNorms="true">
+       <analyzer type="index">
+         <tokenizer class="solr.KeywordTokenizerFactory"/>
+       </analyzer>
+       <analyzer type="query">
+         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
+         <filter class="solr.ShingleFilterFactory" outputUnigrams="true" 
outputUnigramIfNoNgram="true" maxShingleSize="99"/>
+         <filter class="solr.PositionFilterFactory" />
+       </analyzer>
+     </fieldType>
+ }}}
+

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by MichaelSembWever

Reply via email to