Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by YonikSeeley: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters ------------------------------------------------------------------------------ = Analyzers, Tokenizers, and Token Filters = - /!\ :TODO: /!\ Package names are all probably wrong and need fixed - - When a document comes in, individual fields are subject to the analyzing and tokenizing filters that can transform the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At collection time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'. + When a document is indexed, it's individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'. '''Note:''' - Before continuing with this doc, it's recommended that you read the following sections in [http://lucenebook.com/search Lucene In Action]: + For a good background on Lucene Analysis, it's recommended that you read the following sections in [http://lucenebook.com/search Lucene In Action]: * 1.5.3 : Analyzer * Chapter 4.0 through 4.7 at least @@ -18, +16 @@ == Stemming == - Two types of stemming are available to you: + There are two types of stemming strategies: * [http://tartarus.org/~martin/PorterStemmer/ Porter] or Reduction stemming — A transforming algorithm that reduces any of the forms of a word such "runs, running, ran", to its elemental root e.g., "run". Porter stemming must be performed ''both'' at insertion time and at query time. * Expansion stemming — Takes a root word and 'expands' it to all of its various forms — can be used ''either'' at insertion time ''or'' at query time. == Analyzers == - Analyzers are components that pre-process input text at index time and/or at search time. Because a search string has to be processed the same way that the indexed text was processed, ''it is important to use the same Analyzer for both indexing and searching. Not using the same Analyzer will likely result in invalid search results.'' /!\ :TODO: /!\ this isn't really true.. rephrase -YCS + Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words. - The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to implement a custom Analyzer. + The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema. == Tokens and Token Filters == - An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. + An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. + + The Solar web admin interface may be used to show the results of text analysis, and even the results after each analysis phase if a custom analyzer is used. == Specifying an Analyzer in the schema == @@ -41, +41 @@ <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/> </fieldtype> }}} - 1. Specifing a '''Tokenizer''' followed by a list of optional !TokenFilters that are applied in the listed order. Factories that can create the tokenizers or token filters are used to avoid the overhead of creation via reflection. [[BR]] Example: [[BR]] {{{ + 1. Specifing a '''Tokenizer''' followed by a list of optional !TokenFilters that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection. [[BR]] Example: [[BR]] {{{ <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> @@ -83, +83 @@ Creates `org.apache.lucene.analysis.standard.StandardTokenizer`. - A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The StandardFilter is the only Lucene filter that utilizes token type. + A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The StandardFilter is currently the only Lucene filter that utilizes token type. Some token types are number, alphanumeric, email, acronym, URL, etc. — @@ -102, +102 @@ * Removes XML elements starting with <! and ending with > * Removes contents of <script> and <style> elements. * Handles XML comments inside these elements (normal comment processing won't always work) - * Replaces numeric character entities references like A or  + * Replaces numeric character entities references like {{{A}}} or {{{}}} * The terminating ';' is optional if the entity reference is followed by whitespace. * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named character entity references]. * is replaced with a space instead of 0xa0 @@ -122, +122 @@ Strips HTML from the input stream and passes the result to a !StandardTokenizer. - See solr.HTMLStripWhitespaceTokenizerFactory for details on HTML stripping. + See {{{solr.HTMLStripWhitespaceTokenizerFactory}}} for details on HTML stripping. === TokenFilterFactories === @@ -158, +158 @@ "they", "this", "to", "was", "will", "with" }}} + A customized stop word list may be specified with the "words" attribute in the schema. - A customized stop word list may be specified with the "words" attribute in the schema. The file referenced by the words parameter will be loaded by the !ClassLoader and hence must be in the classpath. - - {{{ <fieldtype name="teststop" class="solr.TextField"> @@ -199, +197 @@ Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html English Porter2 stemmer] from the Java classes generated from a [http://snowball.tartarus.org/ Snowball] specification. - A customized protected word list may be specified with the "protected" attribute in the schema. The file referenced will be loaded by the !ClassLoader and hence must be in the classpath. Any words in the protected word list will not be modified (stemmed). + A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by the stemmer. A [http://svn.apache.org/repos/asf/incubator/solr/trunk/example/conf/protwords.txt sample SOLR protwords.txt with comments] can be found in the Source Repository. @@ -212, +210 @@ </fieldtype> }}} - '''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses reflection to stem every word. + '''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java reflection to stem every word. ==== solr.WordDelimiterFilterFactory ====
