[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Apache Wiki Fri, 14 Apr 2006 17:16:28 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by HossMan:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

------------------------------------------------------------------------------
   1. The Lucene !QueryParser tokenizes on white space before giving any text 
to the Analyzer, so if a person searches for the words `sea biscit` the 
analyzer will be given the words "sea" and "biscit" seperately, and will not 
know that they match a synonym.
   1. Phrase searching (ie: `"sea biscit"`) will cause the !QueryParser to pass 
the entire string to the analyzer, but if the !SynonymFilter is configured to 
expand the synonyms, then when the !QueryParser gets the resulting list of 
tokens back from the Analyzer, it will construct a !MultiPhraseQuery that will 
not have the desired effect.  This is because of the limited mechanism 
available for the Analyzer to indicate that two terms occupy the same position: 
there is no way to indicate that a "phrase" occupies the same position as a 
term.  For our example the resulting !MultiPhraseQuery would be `"(sea | sea | 
seabiscuit) (biscuit | biscit)"` which would not match the simple case of 
"seabisuit" occuring in a document
  
+ Even when you aren't worried about multi-word synonyms, idf differences still 
make index time synonyms a good idea. Consider the following scenerio:
+ 
+    * An index with a "text" field, which at query time uses the 
!SynonymFilter with the synonym `TV, Televesion` and `expand="true"`
+    * Many thousands of documents containing the term "text:TV"
+    * A few hundred documents containing the term "text:Television"
+ 
+ A query for `text:TV` will expand into `(text:TV text:Television)` and the 
lower docFreq for `text:Television` will give the documents that match 
"Television" a much higher score then docs that match "TV" comparably -- which 
may be somewhat counter intuative to the client.  Index time expansion (or 
reduction) will result in the same idf for all documents regardless of which 
term the orriginal text contained.
+

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Reply via email to