Typically the white space tokenizer is the best choice when the word
delimiter filter will be used.
-- Jack Krupansky
-----Original Message-----
From: Shawn Heisey
Sent: Wednesday, April 16, 2014 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterFilterFactory and StandardTokenizer
On 4/16/2014 8:37 PM, Bob Laferriere wrote:
I am seeing odd behavior from WordDelimiterFilterFactory (WDFF) when
used in conjunction with StandardTokenizerFactory (STF).
<snip>
I see the following results for the document of “wi-fi”:
Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”
The documentation seems to indicate that I should see the same results
in either case as the WDFF is handling the generation of word parts.
But the concatenate of words does not seem to work with a
StandardTokenizer?
The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do. The following page links to a
Unicode document that explains how it all works:
http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.
https://cwiki.apache.org/confluence/display/solr/Analysis+Screen
Thanks,
Shawn