Hello all, We're running into a weird issue with Word Delimiter and apostrophes. For a text field that uses the out of the box field definition:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="com.jodange.solr.KStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="com.jodange.solr.KStemFilterFactory"/> </analyzer> </fieldType> (note that com.jodange.solr.KStemFilterFactory is a backport of KStem for Solr 1.4 we hacked together.) The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates two tokens for 'johnson.' Basically it looks like WordDelimiter is splitting on the apostrophe in "Johnson's", emitting the token 'johnson' for the left part, and both the tokens 's' and 'johnsons' for the right part, and later, stemming takes that down to 'johnson'. Which is kind of difficult if you're searching for Johnson and Johnson! Here's a image of the analysis happening: http://imgur.com/BUuNT Two questions, 1. I would have expected the catenated token to show up at the same position as the left hand side token, since they begin with the same letters. Does that not make sense? 2. Does it make sense to filter out apostrophes prior to WordDelimiter to prevent this from happening, or will that cause other issues? Thanks, Michael Della Bitta ------------------------------------------------ Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game