Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "AnalyzersTokenizersTokenFilters" page has been changed by HossMan. The comment on this change is: some wording clean up and section reorg. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=95&rev2=96 -------------------------------------------------- = Analyzers, Tokenizers, and Token Filters = - == Overview == + When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [[http://en.wikipedia.org/wiki/Soundex|Soundex]] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'. The lists below provide an overview of '''''some''''' of the more heavily used Tokenizers and !TokenFilters provided by Solr "out of the box" along with tips/examples of using them. '''This list should by no means be considered the "complete" list of all Analysis classes available in Solr!''' In addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a [[SolrPlugins|Plugin]]. @@ -18, +18 @@ Try searches for "analyzer", "token", and "stemming". <<TableOfContents>> + + = High Level Concepts = == Stemming == There are four types of stemming strategies: @@ -32, +34 @@ On wildcard and fuzzy searches, no text analysis is performed on the search word. - The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema. + Most Solr users define custom Analyzers for their text field types consisting of zero or more Char Filter Factories, one Tokenizer Factory, and zero or more Token Filter Factories; but it is also possible to configure a field type to use a concrete Analyzer implementation + The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase when a configuration bsaed analyzer is used. + - == Char Filters == + === Char Filters === <!> [[Solr1.4]] - Char Filter is a component that pre-processes input characters. It can be chained like as Token Filters and placed in front of a Tokenizer. Char Filters can add, change, or remove characters without worrying about fault of Token offsets. + Char Filter is a component that pre-processes input characters (consuming and producing a character stream) that can add, change, or remove characters while preserving character position information. + Char Filters can be chained. - == Tokens and Token Filters == - An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. - The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase if a custom analyzer is used. + === Tokenizers === + A Tokenizer that splits-up a stream of characters (from each individual field value) into a series of tokens. + + Thee can only be one Tokenizer in each Analyzer. + + === Token Filters === + + Tokens produced by the Tokenizer are passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. + + - == Specifying an Analyzer in the schema == + === Specifying an Analyzer in the schema === A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of `solr.TextField` will have Analyzers explicitly specified in the schema): 1. Specifying the '''class name''' of an Analyzer — anything extending org.apache.lucene.analysis.Analyzer. <<BR>> Example: <<BR>> @@ -57, +69 @@ {{{ <fieldtype name="text" class="solr.TextField"> <analyzer> + <charFilter class="solr.HTMLStripCharFilterFactory"/> + <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> @@ -66, +80 @@ </fieldtype> }}} - Any Analyzer, !TokenizerFactory, or !TokenFilterFactory may be specified using its full class name with package -- just make sure they are in Solr's classpath when you start your appserver. Classes in the `org.apache.solr.analysis.*` package can be referenced using the short alias `solr.*`. + Any Analyzer, !CharFilterFactory, !TokenizerFactory, or !TokenFilterFactory may be specified using its full class name with package -- just make sure they are in Solr's classpath when you start your appserver. Classes in the `org.apache.solr.analysis.*` package can be referenced using the short alias `solr.*`. - If you want to use custom Tokenizers or !TokenFilters, you'll need to write a very simple factory that subclasses !BaseTokenizerFactory or !BaseTokenFilterFactory, something like this... + If you want to use custom !CharFilters, Tokenizers or !TokenFilters, you'll need to write a very simple factory that subclasses !BaseTokenizerFactory or !BaseTokenFilterFactory, something like this... {{{ public class MyCustomFilterFactory extends BaseTokenFilterFactory { @@ -77, +91 @@ } } }}} + + = Notes On Specific Factories = - === CharFilterFactories === + == CharFilterFactories == <!> [[Solr1.4]] - ==== Example ==== - {{{ - <fieldType name="charfilthtmlmap" class="solr.TextField"> - <analyzer> - <charFilter class="solr.HTMLStripCharFilterFactory"/> - <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> - <tokenizer class="solr.WhitespaceTokenizerFactory"/> - </analyzer> - </fieldType> - }}} - ==== solr.MappingCharFilterFactory ==== + === solr.MappingCharFilterFactory === Creates `org.apache.lucene.analysis.MappingCharFilter`. - ==== solr.PatternReplaceCharFilterFactory ==== + === solr.PatternReplaceCharFilterFactory === Creates `org.apache.solr.analysis.PatternReplaceCharFilter`. Applies a regex pattern to string in char stream, replacing match occurances with the specified replacement string. - ==== solr.HTMLStripCharFilterFactory ==== + === solr.HTMLStripCharFilterFactory === Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML from the input stream and passes the result to either `CharFilter` or `Tokenizer`. HTML stripping features: @@ -125, +131 @@ - - === TokenizerFactories === + == TokenizerFactories == Solr provides the following !TokenizerFactories (Tokenizers and !TokenFilters): - ==== solr.KeywordTokenizerFactory ==== + === solr.KeywordTokenizerFactory === Creates `org.apache.lucene.analysis.core.KeywordTokenizer`. Treats the entire field as a single token, regardless of its content. . Example: `"http://example.com/I-am+example?Text=-Hello" ==> "http://example.com/I-am+example?Text=-Hello"` - ==== solr.LetterTokenizerFactory ==== + === solr.LetterTokenizerFactory === Creates `org.apache.lucene.analysis.LetterTokenizer`. Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded. @@ -145, +150 @@ <<Anchor(WhitespaceTokenizer)>> - ==== solr.WhitespaceTokenizerFactory ==== + === solr.WhitespaceTokenizerFactory === Creates `org.apache.lucene.analysis.WhitespaceTokenizer`. Creates tokens of characters separated by splitting on whitespace. - ==== solr.LowerCaseTokenizerFactory ==== + === solr.LowerCaseTokenizerFactory === Creates `org.apache.lucene.analysis.LowerCaseTokenizer`. Creates tokens by lowercasing all letters and dropping non-letters. @@ -159, +164 @@ <<Anchor(StandardTokenizer)>> - ==== solr.StandardTokenizerFactory ==== + === solr.StandardTokenizerFactory === Creates `org.apache.lucene.analysis.standard.StandardTokenizer`. A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types. @@ -170, +175 @@ <<Anchor(HTMLStripWhitespaceTokenizer)>> - ==== solr.HTMLStripWhitespaceTokenizerFactory ==== + === solr.HTMLStripWhitespaceTokenizerFactory === Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer. See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping. - ==== solr.HTMLStripStandardTokenizerFactory ==== + === solr.HTMLStripStandardTokenizerFactory === Strips HTML from the input stream and passes the result to a !StandardTokenizer. See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping. - ==== solr.PatternTokenizerFactory ==== + === solr.PatternTokenizerFactory === Breaks text at the specified regular expression pattern. For example, you have a list of terms, delimited by a semicolon and zero or more spaces: `mice; kittens; dogs`. @@ -194, +199 @@ }}} See the javadoc for details. - === TokenFilterFactories === + == TokenFilterFactories == + + <<Anchor(StandardFilter)>> - - ==== solr.StandardFilterFactory ==== + === solr.StandardFilterFactory === Creates `org.apache.lucene.analysis.standard.StandardFilter`. Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e., those produced by !StandardTokenizer or equivalent. @@ -207, +213 @@ <<Anchor(LowerCaseFilter)>> - ==== solr.LowerCaseFilterFactory ==== + === solr.LowerCaseFilterFactory === Creates `org.apache.lucene.analysis.LowerCaseFilter`. Lowercases the letters in each token. Leaves non-letter tokens alone. @@ -216, +222 @@ <<Anchor(TrimFilter)>> - ==== solr.TrimFilterFactory ==== + === solr.TrimFilterFactory === <!> [[Solr1.2]] Creates `org.apache.solr.analysis.TrimFilter`. @@ -229, +235 @@ <<Anchor(StopFilter)>> - ==== solr.StopFilterFactory ==== + === solr.StopFilterFactory === Creates `org.apache.lucene.analysis.StopFilter`. Discards common words. @@ -255, +261 @@ }}} <<Anchor(CommonGramsFilter)>> - ==== solr.CommonGramsFilterFactory ==== + === solr.CommonGramsFilterFactory === Creates `org.apache.solr.analysis.CommonGramsFilter`. <!> [[Solr1.4]] Makes shingles (i.e. the_cat) by combining common tokens (usually the same as the stop words list) and regular tokens. CommonGramsFilter is useful for issuing phrase queries (i.e. "the cat") that contain stop words. Normally phrases containing stop words would not match their intended target and instead, the query "the cat" would match all documents containing "cat", which can be undesirable behavior. Phrase query slop (eg, "the cat"~2) will not function as intended because common grams are indexed as shingled tokens that are adjacent to each other (i.e. the_cat is indexed as a single term). The CommonGramsQueryFilter converts the phrase query "the cat" into the single term query the_cat. @@ -274, +280 @@ <<Anchor(EdgeNGramFilter)>> - ==== solr.EdgeNGramFilterFactory ==== + === solr.EdgeNGramFilterFactory === Creates `org.apache.solr.analysis.EdgeNGramTokenFilter`. By default, create n-grams from the beginning edge of a input token. @@ -297, +303 @@ }}} <<Anchor(KeepWordFilter)>> - ==== solr.KeepWordFilterFactory ==== + === solr.KeepWordFilterFactory === Creates `org.apache.solr.analysis.KeepWordFilter`. <!> [[Solr1.3]] Keep words on a list. This is the inverse behavior of StopFilterFactory. The word file format is identical. @@ -311, +317 @@ }}} <<Anchor(LengthFilter)>> - ==== solr.LengthFilterFactory ==== + === solr.LengthFilterFactory === Creates `solr.LengthFilter`. Filters out those tokens *not* having length min through max inclusive. @@ -326, +332 @@ }}} <<Anchor(WordDelimiterFilter)>> - ==== solr.WordDelimiterFilterFactory ==== + === solr.WordDelimiterFilterFactory === Creates `solr.analysis.WordDelimiterFilter`. Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules: @@ -426, +432 @@ custom character categories. An example file is in subversion [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/test-files/solr/conf/wdftypes.txt|here]]. <<Anchor(SynonymFilter)>> - ==== solr.SynonymFilterFactory ==== + === solr.SynonymFilterFactory === Creates `SynonymFilter`. Matches strings of tokens and replaces them with other strings of tokens. @@ -492, +498 @@ <<Anchor(RemoveDuplicatesTokenFilter)>> - ==== solr.RemoveDuplicatesTokenFilterFactory ==== + === solr.RemoveDuplicatesTokenFilterFactory === Creates `org.apache.solr.analysis.RemoveDuplicatesTokenFilter`. Filters out any tokens which are at the same logical position in the tokenstream as a previous token with the same text. This situation can arise from a number of situations depending on what the "up stream" token filters are -- notably when stemming synonyms with similar roots. It is usefull to remove the duplicates to prevent `idf` inflation at index time, or `tf` inflation (in a !MultiPhraseQuery) at query time. <<Anchor(ISOLatin1AccentFilter)>> - ==== solr.ISOLatin1AccentFilterFactory ==== + === solr.ISOLatin1AccentFilterFactory === Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`. Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. Note that this is deprecated in favor of !ASCIIFoldingFilterFactory. <<Anchor(ASCIIFoldingFilterFactory)>> - ==== solr.ASCIIFoldingFilterFactory ==== + === solr.ASCIIFoldingFilterFactory === Creates `org.apache.lucene.analysis.ASCIIFoldingFilter`. Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. @@ -519, +525 @@ <<Anchor(PhoneticFilterFactory)>> - ==== solr.PhoneticFilterFactory ==== + === solr.PhoneticFilterFactory === <!> [[Solr1.2]] Creates `org.apache.solr.analysis.PhoneticFilter`. @@ -538, +544 @@ }}} <<Anchor(ShingleFilterFactory)>> - ==== solr.ShingleFilterFactory ==== + === solr.ShingleFilterFactory === <!> [[Solr1.3]] Creates [[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html|org.apache.lucene.analysis.shingle.ShingleFilter]]. @@ -561, +567 @@ }}} <<Anchor(PositionFilterFactory)>> - ==== solr.PositionFilterFactory ==== + === solr.PositionFilterFactory === <!> [[Solr1.4]] Creates [[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html|org.apache.lucene.analysis.position.PositionFilter]]. @@ -618, +624 @@ }}} <<Anchor(ReversedWildcardFilterFactory)>> - ==== solr.ReversedWildcardFilterFactory ==== + === solr.ReversedWildcardFilterFactory === <!> [[Solr1.4]] A filter that reverses tokens to provide faster leading wildcard and prefix queries. Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text. @@ -627, +633 @@ <<Anchor(CollationKeyFilterFactory)>> - ==== solr.CollationKeyFilterFactory ==== + === solr.CollationKeyFilterFactory === <!> [[Solr1.5]] A filter that lets one specify: