Typically the white space tokenizer is the best choice when the word delimiter filter will be used.

-- Jack Krupansky

-----Original Message----- From: Shawn Heisey
Sent: Wednesday, April 16, 2014 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterFilterFactory and StandardTokenizer

On 4/16/2014 8:37 PM, Bob Laferriere wrote:
I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when
used in conjunction with StandardTokenizerFactory (STF).

<snip>

I see the following results for the document of “wi-fi”:

Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”

The documentation seems to indicate that I should see the same results
in either case as the WDFF is handling the generation of word parts.
But the concatenate of words does not seem to work with a
StandardTokenizer?

The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do.  The following page links to a
Unicode document that explains how it all works:

http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.

https://cwiki.apache.org/confluence/display/solr/Analysis+Screen

Thanks,
Shawn

Reply via email to