I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when used in conjunction with StandardTokenizerFactory (STF).
 
If I use the following configuration:
 
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
                          <analyzer type="index">
                                          <tokenizer class="solr.StandardTokenizerFactory"/>
                            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>
                          <analyzer type="query">
                                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>
 
 
I see the following results for the document of “wi-fi”:
 
Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”
 
The documentation seems to indicate that I should see the same results in either case as the WDFF is handling the generation of word parts. But the concatenate of words does not seem to work with a StandardTokenizer? If I flip to use the WhiteSpaceTokenizerFactory on the index handler, I get the following:
 
Index: “wi”,”fi”,”wifi”
 
I checked all documentation and did not find any indication that there is a conflict between using the WDFF and STF vs WDFF and WhitespaceTokenizer. I assume it is because STF is tokenizing off the hyphen first before passing to the filter chain?
 
_______________________________________
Robert J. Laferriere 
Director of Software Technology, Corporate Information Services
Chief Software Architect

Direct Supply . 6767 N Industrial Rd  Milwaukee, WI  53223
office 414-760-5833 . mobile 414-721-1092 . fax 877-282-5285

Reply via email to