[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Apache Wiki Wed, 08 Dec 2010 12:23:14 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "AnalyzersTokenizersTokenFilters" page has been changed by HossMan.
The comment on this change is: some wording clean up and section reorg.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=95&rev2=96

--------------------------------------------------

  = Analyzers, Tokenizers, and Token Filters =
- == Overview ==
+ 
  When a document is indexed, its individual fields are subject to the 
analyzing and tokenizing filters that can transform and normalize the data in 
the fields. For example — removing blank spaces, removing html code, stemming, 
removing a particular character and replacing it with another. At indexing time 
as well as at query time you may need to do some of the above or similiar 
operations. For example, you might perform a 
[[http://en.wikipedia.org/wiki/Soundex|Soundex]] transformation (a type of 
phonic hashing) on a string to enable a search based upon the word and upon its 
'sound-alikes'.
  
  The lists below provide an overview of '''''some''''' of the more heavily 
used Tokenizers and !TokenFilters provided by Solr "out of the box" along with 
tips/examples of using them.  '''This list should by no means be considered the 
"complete" list of all Analysis classes available in Solr!'''  In addition to 
new classes being added on an ongoing basis, you can load your own custom 
Analysis code as a [[SolrPlugins|Plugin]].
@@ -18, +18 @@

  Try searches for "analyzer", "token", and "stemming".
  
  <<TableOfContents>>
+ 
+ = High Level Concepts =
  
  == Stemming ==
  There are four types of stemming strategies:
@@ -32, +34 @@

  
  On wildcard and fuzzy searches, no text analysis is performed on the search 
word.
  
- The Analyzer class is an abstract class, but Lucene comes with a few concrete 
Analyzers that pre-process their input in different ways. If you need to 
pre-process input text and queries in a way that is not provided by any of 
Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the 
Solr schema.
+ Most Solr users define custom Analyzers for their text field types consisting 
of zero or more Char Filter Factories, one Tokenizer Factory, and zero or more 
Token Filter Factories; but it is also possible to configure a field type to 
use a concrete Analyzer implementation
  
+ The Solr web admin interface may be used to show the results of text 
analysis, and even the results after each analysis phase when a configuration 
bsaed analyzer is used.
+ 
- == Char Filters ==
+ === Char Filters ===
  <!> [[Solr1.4]]
  
- Char Filter is a component that pre-processes input characters. It can be 
chained like as Token Filters and placed in front of a Tokenizer. Char Filters 
can add, change, or remove characters without worrying about fault of Token 
offsets.
+ Char Filter is a component that pre-processes input characters (consuming and 
producing a character stream) that can add, change, or remove characters while 
preserving character position information.
  
+ Char Filters can be chained.
- == Tokens and Token Filters ==
- An analyzer splits up a text field into tokens that the field is indexed by. 
An Analyzer is normally implemented by creating a '''Tokenizer''' that 
splits-up a stream (normally a single field value) into a series of tokens. 
These tokens are then passed through a series of Token Filters that add, 
change, or remove tokens. The field is then indexed by the resulting token 
stream.
  
- The Solr web admin interface may be used to show the results of text 
analysis, and even the results after each analysis phase if a custom analyzer 
is used.
+ === Tokenizers ===
  
+ A Tokenizer that splits-up a stream of characters (from each individual field 
value) into a series of tokens.
+ 
+ Thee can only be one Tokenizer in each Analyzer.
+ 
+ === Token Filters ===
+ 
+ Tokens produced by the Tokenizer are passed through a series of Token Filters 
that add, change, or remove tokens. The field is then indexed by the resulting 
token stream.
+ 
+ 
- == Specifying an Analyzer in the schema ==
+ === Specifying an Analyzer in the schema ===
  A Solr schema.xml file allows two methods for specifying the way a text field 
is analyzed. (Normally only field types of `solr.TextField` will have Analyzers 
explicitly specified in the schema):
  
   1. Specifying the '''class name''' of an Analyzer — anything extending 
org.apache.lucene.analysis.Analyzer. <<BR>> Example: <<BR>>
@@ -57, +69 @@

   {{{
  <fieldtype name="text" class="solr.TextField">
    <analyzer>
+     <charFilter class="solr.HTMLStripCharFilterFactory"/>
+     <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
@@ -66, +80 @@

  </fieldtype>
  }}}
  
- Any Analyzer, !TokenizerFactory, or !TokenFilterFactory may be specified 
using its full class name with package -- just make sure they are in Solr's 
classpath when you start your appserver.  Classes in the 
`org.apache.solr.analysis.*` package can be referenced using the short alias 
`solr.*`.
+ Any Analyzer, !CharFilterFactory, !TokenizerFactory, or !TokenFilterFactory 
may be specified using its full class name with package -- just make sure they 
are in Solr's classpath when you start your appserver.  Classes in the 
`org.apache.solr.analysis.*` package can be referenced using the short alias 
`solr.*`.
  
- If you want to use custom Tokenizers or !TokenFilters, you'll need to write a 
very simple factory that subclasses !BaseTokenizerFactory or 
!BaseTokenFilterFactory, something like this...
+ If you want to use custom !CharFilters, Tokenizers or !TokenFilters, you'll 
need to write a very simple factory that subclasses !BaseTokenizerFactory or 
!BaseTokenFilterFactory, something like this...
  
  {{{
  public class MyCustomFilterFactory extends BaseTokenFilterFactory {
@@ -77, +91 @@

    }
  }
  }}}
+ 
+ = Notes On Specific Factories =
- === CharFilterFactories ===
+ == CharFilterFactories ==
  <!> [[Solr1.4]]
  
- ==== Example ====
- {{{
- <fieldType name="charfilthtmlmap" class="solr.TextField">
-       <analyzer>
-         <charFilter class="solr.HTMLStripCharFilterFactory"/>
-         <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
-         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-       </analyzer>
-     </fieldType>
- }}}
- ==== solr.MappingCharFilterFactory ====
+ === solr.MappingCharFilterFactory ===
  Creates `org.apache.lucene.analysis.MappingCharFilter`.
  
- ==== solr.PatternReplaceCharFilterFactory ====
+ === solr.PatternReplaceCharFilterFactory ===
  Creates `org.apache.solr.analysis.PatternReplaceCharFilter`. Applies a regex 
pattern to string in char stream, replacing match occurances with the specified 
replacement string.
  
- ==== solr.HTMLStripCharFilterFactory ====
+ === solr.HTMLStripCharFilterFactory ===
  Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` 
strips HTML from the input stream and passes the result to either `CharFilter` 
or `Tokenizer`.
  
  HTML stripping features:
@@ -125, +131 @@

  
  
  
- 
- === TokenizerFactories ===
+ == TokenizerFactories ==
  Solr provides the following  !TokenizerFactories (Tokenizers and 
!TokenFilters):
  
- ==== solr.KeywordTokenizerFactory ====
+ === solr.KeywordTokenizerFactory ===
  Creates `org.apache.lucene.analysis.core.KeywordTokenizer`.
  
  Treats the entire field as a single token, regardless of its content.
  
   . Example: `"http://example.com/I-am+example?Text=-Hello"; ==> 
"http://example.com/I-am+example?Text=-Hello"`
  
- ==== solr.LetterTokenizerFactory ====
+ === solr.LetterTokenizerFactory ===
  Creates `org.apache.lucene.analysis.LetterTokenizer`.
  
  Creates tokens consisting of strings of contiguous letters. Any non-letter 
characters will be discarded.
@@ -145, +150 @@

  
  <<Anchor(WhitespaceTokenizer)>>
  
- ==== solr.WhitespaceTokenizerFactory ====
+ === solr.WhitespaceTokenizerFactory ===
  Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
  
  Creates tokens of characters separated by splitting on whitespace.
  
- ==== solr.LowerCaseTokenizerFactory ====
+ === solr.LowerCaseTokenizerFactory ===
  Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
  
  Creates tokens by lowercasing all letters and dropping non-letters.
@@ -159, +164 @@

  
  <<Anchor(StandardTokenizer)>>
  
- ==== solr.StandardTokenizerFactory ====
+ === solr.StandardTokenizerFactory ===
  Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
  
  A good general purpose tokenizer that strips many extraneous characters and 
sets token types to meaningful values.  Token types are only useful for 
subsequent token filters that are type-aware.  The !StandardFilter is currently 
the only Lucene filter that utilizes token types.
@@ -170, +175 @@

  
  <<Anchor(HTMLStripWhitespaceTokenizer)>>
  
- ==== solr.HTMLStripWhitespaceTokenizerFactory ====
+ === solr.HTMLStripWhitespaceTokenizerFactory ===
  Strips HTML from the input stream and passes the result to a 
!WhitespaceTokenizer.
  
  See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
- ==== solr.HTMLStripStandardTokenizerFactory ====
+ === solr.HTMLStripStandardTokenizerFactory ===
  Strips HTML from the input stream and passes the result to a 
!StandardTokenizer.
  
  See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
- ==== solr.PatternTokenizerFactory ====
+ === solr.PatternTokenizerFactory ===
  Breaks text at the specified regular expression pattern.
  
  For example, you have a list of terms, delimited by a semicolon and zero or 
more spaces: `mice; kittens; dogs`.
@@ -194, +199 @@

  }}}
  See the javadoc for details.
  
- === TokenFilterFactories ===
+ == TokenFilterFactories ==
+ 
+ 
  <<Anchor(StandardFilter)>>
- 
- ==== solr.StandardFilterFactory ====
+ === solr.StandardFilterFactory ===
  Creates `org.apache.lucene.analysis.standard.StandardFilter`.
  
  Removes dots from acronyms and 's from the end of tokens. Works only on typed 
tokens, i.e., those produced by !StandardTokenizer or equivalent.
@@ -207, +213 @@

  
  <<Anchor(LowerCaseFilter)>>
  
- ==== solr.LowerCaseFilterFactory ====
+ === solr.LowerCaseFilterFactory ===
  Creates `org.apache.lucene.analysis.LowerCaseFilter`.
  
  Lowercases the letters in each token. Leaves non-letter tokens alone.
@@ -216, +222 @@

  
  <<Anchor(TrimFilter)>>
  
- ==== solr.TrimFilterFactory ====
+ === solr.TrimFilterFactory ===
  <!> [[Solr1.2]]
  
  Creates `org.apache.solr.analysis.TrimFilter`.
@@ -229, +235 @@

  
  <<Anchor(StopFilter)>>
  
- ==== solr.StopFilterFactory ====
+ === solr.StopFilterFactory ===
  Creates `org.apache.lucene.analysis.StopFilter`.
  
  Discards common words.
@@ -255, +261 @@

  }}}
  <<Anchor(CommonGramsFilter)>>
  
- ==== solr.CommonGramsFilterFactory ====
+ === solr.CommonGramsFilterFactory ===
  Creates `org.apache.solr.analysis.CommonGramsFilter`. <!> [[Solr1.4]]
  
  Makes shingles (i.e. the_cat) by combining common tokens (usually the same as 
the stop words list) and regular tokens.  CommonGramsFilter is useful for 
issuing phrase queries (i.e. "the cat") that contain stop words.  Normally 
phrases containing stop words would not match their intended target and 
instead, the query "the cat" would match all documents containing "cat", which 
can be undesirable behavior.  Phrase query slop (eg, "the cat"~2) will not 
function as intended because common grams are indexed as shingled tokens that 
are adjacent to each other (i.e. the_cat is indexed as a single term).  The 
CommonGramsQueryFilter converts the phrase query "the cat" into the single term 
query the_cat.
@@ -274, +280 @@

  
  <<Anchor(EdgeNGramFilter)>>
  
- ==== solr.EdgeNGramFilterFactory ====
+ === solr.EdgeNGramFilterFactory ===
  Creates `org.apache.solr.analysis.EdgeNGramTokenFilter`.
  
  By default, create n-grams from the beginning edge of a input token.
@@ -297, +303 @@

  }}}
  <<Anchor(KeepWordFilter)>>
  
- ==== solr.KeepWordFilterFactory ====
+ === solr.KeepWordFilterFactory ===
  Creates `org.apache.solr.analysis.KeepWordFilter`. <!> [[Solr1.3]]
  
  Keep words on a list.  This is the inverse behavior of StopFilterFactory.  
The word file format is identical.
@@ -311, +317 @@

  }}}
  <<Anchor(LengthFilter)>>
  
- ==== solr.LengthFilterFactory ====
+ === solr.LengthFilterFactory ===
  Creates `solr.LengthFilter`.
  
  Filters out those tokens *not* having length min through max inclusive.
@@ -326, +332 @@

  }}}
  <<Anchor(WordDelimiterFilter)>>
  
- ==== solr.WordDelimiterFilterFactory ====
+ === solr.WordDelimiterFilterFactory ===
  Creates `solr.analysis.WordDelimiterFilter`.
  
  Splits words into subwords and performs optional transformations on subword 
groups. By default, words are split into subwords with the following rules:
@@ -426, +432 @@

  custom character categories. An example file is in subversion 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/test-files/solr/conf/wdftypes.txt|here]].
  <<Anchor(SynonymFilter)>>
  
- ==== solr.SynonymFilterFactory ====
+ === solr.SynonymFilterFactory ===
  Creates `SynonymFilter`.
  
  Matches strings of tokens and replaces them with other strings of tokens.
@@ -492, +498 @@

  
  <<Anchor(RemoveDuplicatesTokenFilter)>>
  
- ==== solr.RemoveDuplicatesTokenFilterFactory ====
+ === solr.RemoveDuplicatesTokenFilterFactory ===
  Creates `org.apache.solr.analysis.RemoveDuplicatesTokenFilter`.
  
  Filters out any tokens which are at the same logical position in the 
tokenstream as a previous token with the same text.  This situation can arise 
from a number of situations depending on what the "up stream" token filters are 
-- notably when stemming synonyms with similar roots.  It is usefull to remove 
the duplicates to prevent `idf` inflation at index time, or `tf` inflation (in 
a !MultiPhraseQuery) at query time.
  
  <<Anchor(ISOLatin1AccentFilter)>>
  
- ==== solr.ISOLatin1AccentFilterFactory ====
+ === solr.ISOLatin1AccentFilterFactory ===
  Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`.
  
  Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by 
their unaccented equivalent. Note that this is deprecated in favor of 
!ASCIIFoldingFilterFactory.
  
  <<Anchor(ASCIIFoldingFilterFactory)>>
  
- ==== solr.ASCIIFoldingFilterFactory ====
+ === solr.ASCIIFoldingFilterFactory ===
  Creates `org.apache.lucene.analysis.ASCIIFoldingFilter`.
  
  Converts alphabetic, numeric, and symbolic Unicode characters which are not 
in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their 
ASCII equivalents, if one exists.
@@ -519, +525 @@

  
  <<Anchor(PhoneticFilterFactory)>>
  
- ==== solr.PhoneticFilterFactory ====
+ === solr.PhoneticFilterFactory ===
  <!> [[Solr1.2]]
  
  Creates `org.apache.solr.analysis.PhoneticFilter`.
@@ -538, +544 @@

  }}}
  <<Anchor(ShingleFilterFactory)>>
  
- ==== solr.ShingleFilterFactory ====
+ === solr.ShingleFilterFactory ===
  <!> [[Solr1.3]]
  
  Creates 
[[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html|org.apache.lucene.analysis.shingle.ShingleFilter]].
@@ -561, +567 @@

  }}}
  <<Anchor(PositionFilterFactory)>>
  
- ==== solr.PositionFilterFactory ====
+ === solr.PositionFilterFactory ===
  <!> [[Solr1.4]]
  
  Creates 
[[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html|org.apache.lucene.analysis.position.PositionFilter]].
@@ -618, +624 @@

  }}}
  <<Anchor(ReversedWildcardFilterFactory)>>
  
- ==== solr.ReversedWildcardFilterFactory ====
+ === solr.ReversedWildcardFilterFactory ===
  <!> [[Solr1.4]]
  
  A filter that reverses tokens to provide faster leading wildcard and prefix 
queries.  Add this filter to the index analyzer, but not the query analyzer.  
The standard Solr query parser (SolrQuerySyntax) will use this to reverse 
wildcard and prefix queries to improve performance (for example, translating 
myfield:*foo into myfield:oof*).  To avoid collisions and false matches, 
reversed tokens are indexed with a prefix that should not otherwise appear in 
indexed text.
@@ -627, +633 @@

  
  <<Anchor(CollationKeyFilterFactory)>>
  
- ==== solr.CollationKeyFilterFactory ====
+ === solr.CollationKeyFilterFactory ===
  <!> [[Solr1.5]]
  
  A filter that lets one specify:

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Reply via email to