WordDelimiterFilterFactory and StandardTokenizer

Bob Laferriere Wed, 16 Apr 2014 19:38:39 -0700

I am seeing odd behavior from WordDelimiterFilterFactory (WDFF) when used in conjunction with StandardTokenizerFactory (STF).

If I use the following configuration:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
                          <analyzer type="index">
                                          <tokenizer class="solr.StandardTokenizerFactory"/>
                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>
                          <analyzer type="query">
                                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>

I see the following results for the document of “wi-fi”:

Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”

The documentation seems to indicate that I should see the same results in either case as the WDFF is handling the generation of word parts. But the concatenate of words does not seem to work with a StandardTokenizer? If I flip to use the WhiteSpaceTokenizerFactory on the index handler, I get the following:

Index: “wi”,”fi”,”wifi”

I checked all documentation and did not find any indication that there is a conflict between using the WDFF and STF vs WDFF and WhitespaceTokenizer. I assume it is because STF is tokenizing off the hyphen first before passing to the filter chain?

_______________________________________
Robert J. Laferriere
Director of Software Technology, Corporate Information Services
Chief Software Architect

Direct Supply . 6767 N Industrial Rd  Milwaukee, WI 53223
office 414-760-5833 . mobile 414-721-1092 . fax 877-282-5285
blaferri...@directs.com .  www.directsupply.com

WordDelimiterFilterFactory and StandardTokenizer

Reply via email to