Re: WordDelimiterFilterFactory and StandardTokenizer

Jack Krupansky Wed, 16 Apr 2014 20:31:32 -0700

Typically the white space tokenizer is the best choice when the worddelimiter filter will be used.


-- Jack Krupansky

-----Original Message-----From: Shawn Heisey

Sent: Wednesday, April 16, 2014 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterFilterFactory and StandardTokenizer

On 4/16/2014 8:37 PM, Bob Laferriere wrote:

I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when
used in conjunction with StandardTokenizerFactory (STF).


<snip>

I see the following results for the document of “wi-fi”:

Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”

The documentation seems to indicate that I should see the same results
in either case as the WDFF is handling the generation of word parts.
But the concatenate of words does not seem to work with a
StandardTokenizer?


The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do.  The following page links to a
Unicode document that explains how it all works:

http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.

https://cwiki.apache.org/confluence/display/solr/Analysis+Screen

Thanks,

Shawn

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to