Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by MichaelSembWever: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters ------------------------------------------------------------------------------ == Overview == - When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'. + When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'. The lists below provide an overview of '''''some''''' of the more heavily used Tokenizers and !TokenFilters provided by Solr "out of the box" along with tips/examples of using them. '''This list should by no means be considered the "complete" list of all Analysis classes available in Solr!''' In addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a [wiki:SolrPlugins Plugin]. For a more complete list of what Tokenizers and !TokenFilters come out of the box, please consult the [http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html javadocs] for the analysis package. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below. - '''Note:''' + '''Note:''' - For a good background on Lucene Analysis, it's recommended that you read the following sections in [http://lucenebook.com/ Lucene In Action]: + For a good background on Lucene Analysis, it's recommended that you read the following sections in [http://lucenebook.com/ Lucene In Action]: * 1.5.3 : Analyzer - * Chapter 4.0 through 4.7 at least + * Chapter 4.0 through 4.7 at least Try searches for "analyzer", "token", and "stemming". @@ -41, +41 @@ An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. - The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase if a custom analyzer is used. + The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase if a custom analyzer is used. == Specifying an Analyzer in the schema == - A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of `solr.TextField` will have Analyzers explicitly specified in the schema): + A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of `solr.TextField` will have Analyzers explicitly specified in the schema): 1. Specifying the '''class name''' of an Analyzer — anything extending org.apache.lucene.analysis.Analyzer. [[BR]] Example: [[BR]] {{{ <fieldtype name="nametext" class="solr.TextField"> @@ -89, +89 @@ ==== solr.LetterTokenizerFactory ==== - Creates `org.apache.lucene.analysis.LetterTokenizer`. + Creates `org.apache.lucene.analysis.LetterTokenizer`. Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded. - Example: `"I can't" ==> "I", "can", "t"` + Example: `"I can't" ==> "I", "can", "t"` [[Anchor(WhitespaceTokenizer)]] ==== solr.WhitespaceTokenizerFactory ==== Creates `org.apache.lucene.analysis.WhitespaceTokenizer`. - Creates tokens of characters separated by splitting on whitespace. + Creates tokens of characters separated by splitting on whitespace. ==== solr.LowerCaseTokenizerFactory ==== @@ -116, +116 @@ Creates `org.apache.lucene.analysis.standard.StandardTokenizer`. A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types. - + Some token types are number, alphanumeric, email, acronym, URL, etc. — Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"` @@ -132, +132 @@ * Attributes within tags are also removed, and attribute quoting is optional. * Removes XML processing instructions: <?foo bar?> * Removes XML comments - * Removes XML elements starting with <! and ending with > + * Removes XML elements starting with <! and ending with > * Removes contents of <script> and <style> elements. * Handles XML comments inside these elements (normal comment processing won't always work) * Replaces numeric character entities references like {{{A}}} or {{{}}} * The terminating ';' is optional if the entity reference is followed by whitespace. * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named character entity references]. * is replaced with a space instead of 0xa0 - * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega Corp" + * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega Corp" HTML stripping examples: @@ -202, +202 @@ <!> ["Solr1.2"] - Creates `org.apache.solr.analysis.TrimFilter`. + Creates `org.apache.solr.analysis.TrimFilter`. Trims whitespace at either end of a token. Example: `" Kittens! ", "Duck" ==> "Kittens!", "Duck"`. - Optionally, the "updateOffsets" attribute will update the start and end position offsets. + Optionally, the "updateOffsets" attribute will update the start and end position offsets. [[Anchor(StopFilter)]] @@ -226, +226 @@ "t", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" }}} - + A customized stop word list may be specified with the "words" attribute in the schema. - Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the stopword list. + Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the stopword list. {{{ <fieldtype name="teststop" class="solr.TextField"> <analyzer> - <tokenizer class="solr.LowerCaseTokenizerFactory"/> + <tokenizer class="solr.LowerCaseTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> </fieldtype> @@ -284, +284 @@ Creates `solr.EnglishPorterFilter`. - Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html English Porter2 stemmer] from the Java classes generated from a [http://snowball.tartarus.org/ Snowball] specification. + Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html English Porter2 stemmer] from the Java classes generated from a [http://snowball.tartarus.org/ Snowball] specification. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by the stemmer. @@ -293, +293 @@ {{{ <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> - <tokenizer class="solr.WhitespaceTokenizerFactory"/> + <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> </analyzer> </fieldtype> }}} - '''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java reflection to stem every word. + '''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java reflection to stem every word. [[Anchor(SnowballPorterFilter)]] ==== solr.SnowballPorterFilterFactory ==== @@ -310, +310 @@ {{{ <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> - <tokenizer class="solr.WhitespaceTokenizerFactory"/> + <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" /> </analyzer> </fieldtype> @@ -373, +373 @@ * '''catenateAll="1"''' causes all subword parts to be catenated: * `"wi-fi-4000" => "wifi4000"` - These parameters may be combined in any way. + These parameters may be combined in any way. * Example of generateWordParts="1" and catenateWords="1": * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` [[BR]] (where 0,1,1 are token positions) * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"` * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"` - One use for !WordDelimiterFilter is to help match words with [:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters]. One way of doing so is to specify `generateWordParts="1" catenateWords="1"` in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying. Given that the current !StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as !WhitespaceTokenizer). + One use for !WordDelimiterFilter is to help match words with [:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters]. One way of doing so is to specify `generateWordParts="1" catenateWords="1"` in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying. Given that the current !StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as !WhitespaceTokenizer). {{{ <fieldtype name="subword" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" - generateWordParts="1" + generateWordParts="1" - generateNumberParts="1" + generateNumberParts="1" - catenateWords="0" + catenateWords="0" - catenateNumbers="0" + catenateNumbers="0" - catenateAll="0" + catenateAll="0" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> @@ -399, +399 @@ <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" - generateWordParts="1" + generateWordParts="1" - generateNumberParts="1" + generateNumberParts="1" - catenateWords="1" + catenateWords="1" - catenateNumbers="1" + catenateNumbers="1" - catenateAll="0" + catenateAll="0" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> @@ -441, +441 @@ #and replace with all alternatives on the RHS. These types of mappings #ignore the expand parameter in the schema. #Examples: - i-pod, i pod => ipod, + i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit #Equivalent synonyms may be separated with commas and give @@ -521, +521 @@ A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token. - For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles". + For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles". || '''arg''' || '''value''' || @@ -532, +532 @@ <filter class="solr.ShingleFilterFactory" maxShingleSize="true" outputUnigrams="true"/> }}} + + + [[Anchor(PositionFilterFactory)]] + ==== solr.PositionFilterFactory ==== + + <!> ["Solr1.4"] + + Creates [http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html org.apache.lucene.analysis.position.PositionFilter]. + + A PositionFilter manipulates the position of tokens in the stream. + + Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its original positionIncrement value. + + || '''arg''' || '''value''' || + || positionIncrement || default 0 || + + {{{ + <filter class="solr.PositionFilterFactory" /> + }}} + + An example is when exact matching hits are wanted for _any_ shingle within the query. (This was done at http://sesam.no to replace three proprietary 'FAST Query-Matching servers' with two open sourced Solr indexes, background reading in [http://sesat.no/howto-solr-query-evaluation.html sesat] and on the [http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 mailing list]). + It was needed that in the query all words and shingles to be placed at the same position, so that all shingles to be treated as synonyms of each other. + + With only the ShingleFilter the shingles generated are synonyms only to the first term in each shingle group. + For example the query "abcd efgh ijkl" results in a query like: + ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl") + where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh". + + ShingleFilter does not offer a way to alter this behaviour. + + Using the PositionFilter in combination makes it possible to make all shingles synonyms of each other. + Such a configuration could look like: + {{{ + <fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> + <analyzer type="index"> + <tokenizer class="solr.KeywordTokenizerFactory"/> + </analyzer> + <analyzer type="query"> + <tokenizer class="solr.WhitespaceTokenizerFactory"/> + <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/> + <filter class="solr.PositionFilterFactory" /> + </analyzer> + </fieldType> + }}} +
