RE: bi-grams for common terms - any analyzers do that?

Burton-West, Tom Mon, 27 Sep 2010 10:27:11 -0700

Hi Jonathan,

>> I'm afraid I'm having trouble understanding   "if the analyzer returns more 
>> than one position back from a "queryparser token"


>>I'm not sure if "the queryparser forms a phrase query without explicit phrase 
>>quotes" is a problem for me, I had no idea it happened until now, never 
>>noticed, and still don't really understand in what circumstances it happens.

The problem I had was for a Boolean query "l'art AND historie" that the 
WordDelimiterFilter tokenized "l'art"  as two tokens "l" at position 1 and 
"art" at position 2.   So the queryparser decided this means a phrase query for 
"l" followed immediately by "art".  See
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for details.  

This would happen whenever any token filter split a token into more than one 
token.  For example a filter that splits foo-bar into "foo" "bar".  The 
exception is  SynonymFilter or something like it.  In the case of 
SynonymFilter, its not really a case of "splitting" one token into multiple 
tokens, but given one token of input, it outputs all the synonyms of the term.  
However all the tokens have the same position attribute. (see: 
http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.19?q=synonym%20filter)

 So for example for the string "the small thing"  if you had a synonym list for 
small:
small=>tiny,teeny"

input:
postion|1   |2    |3
token  |the |small|thing
Would output

postion|1   |2    |2    |2    |3
token  |the |small| tiny|teeny|thing

In this case when the queryParser gets back "small teeny tiny"  since they have 
the same position, they are not turned into a phrase query.

for "l'art"

input
postion|1     
token  |l'art

output
postion|1    |2 
token  |l    |art
In this case there are two tokens with different positions so it treats them as 
a phrase query.

Tom Burton-West

RE: bi-grams for common terms - any analyzers do that?

Reply via email to