Re: WordDelimiterFilterFactory and StandardTokenizer

Shawn Heisey Wed, 16 Apr 2014 20:04:26 -0700

On 4/16/2014 8:37 PM, Bob Laferriere wrote:
>> I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when
>> used in conjunction with StandardTokenizerFactory (STF).


<snip>

>> I see the following results for the document of “wi-fi”:
>>  
>> Index: “wi”, “fi”
>> Query: “wi”,”fi”,”wifi”
>>  
>> The documentation seems to indicate that I should see the same results
>> in either case as the WDFF is handling the generation of word parts.
>> But the concatenate of words does not seem to work with a
>> StandardTokenizer?

The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do.  The following page links to a
Unicode document that explains how it all works:

http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.

https://cwiki.apache.org/confluence/display/solr/Analysis+Screen

Thanks,
Shawn

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to