[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by YonikSeeley

Apache Wiki Sun, 12 Feb 2006 14:59:27 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

------------------------------------------------------------------------------
  = Analyzers, Tokenizers, and Token Filters =
  
- /!\ :TODO: /!\ Package names are all probably wrong and need fixed
- 
- When a document comes in, individual fields are subject to the analyzing and 
tokenizing filters that can transform the data in the fields. For example 
&#151; removing blank spaces, removing html code, stemming, removing a 
particular character and replacing it with another. At collection time as well 
as at query time you may need to do some of the above or similiar operations. 
For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] 
transformation (a type of phonic hashing) on a string to enable a search based 
upon the word and upon its 'sound-alikes'.  
+ When a document is indexed, it's individual fields are subject to the 
analyzing and tokenizing filters that can transform and normalize the data in 
the fields. For example &#151; removing blank spaces, removing html code, 
stemming, removing a particular character and replacing it with another. At 
indexing time as well as at query time you may need to do some of the above or 
similiar operations. For example, you might perform a 
[http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic 
hashing) on a string to enable a search based upon the word and upon its 
'sound-alikes'.  
  
  '''Note:''' 
- Before continuing with this doc, it's recommended that you read the following 
sections in [http://lucenebook.com/search Lucene In Action]: 
+ For a good background on Lucene Analysis, it's recommended that you read the 
following sections in [http://lucenebook.com/search Lucene In Action]: 
   * 1.5.3 : Analyzer
   * Chapter 4.0 through 4.7 at least 
  
@@ -18, +16 @@

  
  == Stemming ==
  
- Two types of stemming are available to you:
+ There are two types of stemming strategies:
     * [http://tartarus.org/~martin/PorterStemmer/ Porter] or Reduction 
stemming &#151; A transforming algorithm that reduces any of the forms of a 
word such  "runs, running, ran", to its elemental root e.g., "run". Porter 
stemming must be performed ''both'' at insertion time and at query time.
     * Expansion stemming &#151; Takes a root word and 'expands' it to all of 
its various forms &#151; can be used ''either'' at insertion time ''or'' at 
query time. 
  
  == Analyzers ==
  
- Analyzers are components that pre-process input text at index time and/or at  
search time. Because a search string has to be processed the same way that the 
indexed text was processed, ''it is important to use the same Analyzer for both 
indexing and searching. Not using the same Analyzer will likely result in 
invalid search results.''  /!\ :TODO: /!\ this isn't really true.. rephrase -YCS
+ Analyzers are components that pre-process input text at index time and/or at  
search time.  It's important to use the same or similar analyzers that process 
text in a compatible manner at index and query time.  For example, if an 
indexing analyzer lowercases words, then the query analyzer should do the same 
to enable finding the indexed words.
  
- The Analyzer class is an abstract class, but Lucene comes with a few concrete 
Analyzers that pre-process their input in different ways. If you need to 
pre-process input text and queries in a way that is not provided by any of 
Lucene's built-in Analyzers, you will need to implement a custom Analyzer.  
+ The Analyzer class is an abstract class, but Lucene comes with a few concrete 
Analyzers that pre-process their input in different ways. If you need to 
pre-process input text and queries in a way that is not provided by any of 
Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the 
Solr schema.
  
  == Tokens and Token Filters ==
  
- An analyzer splits up a text field into tokens that the field is indexed by. 
An Analyzer is normally implemented by creating a '''Tokenizer''' that 
splits-up a stream (normally a single field value) into a series of tokens. 
These tokens are then passed through Token Filters that add, change, or remove 
tokens. The field is then indexed by the resulting token stream.
+ An analyzer splits up a text field into tokens that the field is indexed by. 
An Analyzer is normally implemented by creating a '''Tokenizer''' that 
splits-up a stream (normally a single field value) into a series of tokens. 
These tokens are then passed through a series of Token Filters that add, 
change, or remove tokens. The field is then indexed by the resulting token 
stream.
+ 
+ The Solar web admin interface may be used to show the results of text 
analysis, and even the results after each analysis phase if a custom analyzer 
is used. 
  
  == Specifying an Analyzer in the schema ==
  
@@ -41, +41 @@

    <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
  </fieldtype>
  }}}
-   1.  Specifing a '''Tokenizer''' followed by a list of optional 
!TokenFilters that are applied in the listed order. Factories that can create 
the tokenizers or token filters are used to avoid the overhead of creation via 
reflection. [[BR]] Example: [[BR]] {{{
+   1.  Specifing a '''Tokenizer''' followed by a list of optional 
!TokenFilters that are applied in the listed order. Factories that can create 
the tokenizers or token filters are used to prepare configuration for the 
tokenizer or filter and avoid the overhead of creation via reflection. [[BR]] 
Example: [[BR]] {{{
  <fieldtype name="text" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
@@ -83, +83 @@

  
  Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
  
- A good general purpose tokenizer that strips many extraneous characters and 
sets token types to meaningful values.  Token types are only useful for 
subsequent token filters that are type-aware.  The StandardFilter is the only 
Lucene filter that utilizes token type.
+ A good general purpose tokenizer that strips many extraneous characters and 
sets token types to meaningful values.  Token types are only useful for 
subsequent token filters that are type-aware.  The StandardFilter is currently 
the only Lucene filter that utilizes token type.
     
  Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
  
@@ -102, +102 @@

   * Removes XML elements starting with <! and ending with > 
   * Removes contents of <script> and <style> elements.
     * Handles XML comments inside these elements (normal comment processing 
won't always work)
-    * Replaces numeric character entities references like &#65; or &#x7f;
+    * Replaces numeric character entities references like {{{&#65;}}} or 
{{{&#x7f;}}}
       * The terminating ';' is optional if the entity reference is followed by 
whitespace.
     * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named 
character entity references].
       * &nbsp; is replaced with a space instead of 0xa0
@@ -122, +122 @@

  
  Strips HTML from the input stream and passes the result to a 
!StandardTokenizer.
  
- See solr.HTMLStripWhitespaceTokenizerFactory for details on HTML stripping.
+ See {{{solr.HTMLStripWhitespaceTokenizerFactory}}} for details on HTML 
stripping.
  
  === TokenFilterFactories ===
  
@@ -158, +158 @@

      "they", "this", "to", "was", "will", "with"
  }}}
   
+ A customized stop word list may be specified with the "words" attribute in 
the schema.
- A customized stop word list may be specified with the "words" attribute in 
the schema. The file referenced by the words parameter will be loaded by the 
!ClassLoader and hence must be in the classpath.
- 
- 
  
  {{{
  <fieldtype name="teststop" class="solr.TextField">
@@ -199, +197 @@

  
  Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html 
English Porter2 stemmer] from the Java classes generated from a 
[http://snowball.tartarus.org/ Snowball] specification. 
  
- A customized protected word list may be specified with the "protected" 
attribute in the schema. The file referenced will be loaded by the !ClassLoader 
and hence must be in the classpath. Any words in the protected word list will 
not be modified (stemmed).
+ A customized protected word list may be specified with the "protected" 
attribute in the schema. Any words in the protected word list will not be 
modified by the stemmer.
  
  A 
[http://svn.apache.org/repos/asf/incubator/solr/trunk/example/conf/protwords.txt
 sample SOLR protwords.txt with comments] can be found in the Source Repository.
  
@@ -212, +210 @@

  </fieldtype>
  }}}
  
- '''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses 
reflection to stem every word. 
+ '''Note:''' Due to performance concerns, this implementation does not utilize 
`org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java 
reflection to stem every word. 
  
  ==== solr.WordDelimiterFilterFactory ====

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by YonikSeeley

Reply via email to