Word Delimiter issue

Michael Della Bitta Tue, 31 Jul 2012 10:03:51 -0700

Hello all,

We're running into a weird issue with Word Delimiter and apostrophes.
For a text field that uses the out of the box field definition:


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
<!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="com.jodange.solr.KStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="com.jodange.solr.KStemFilterFactory"/>
</analyzer>
</fieldType>

(note that com.jodange.solr.KStemFilterFactory is a backport of KStem
for Solr 1.4 we hacked together.)

The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates
two tokens for 'johnson.' Basically it looks like WordDelimiter is
splitting on the apostrophe in "Johnson's", emitting the token
'johnson' for the left part, and both the tokens 's' and 'johnsons'
for the right part, and later, stemming takes that down to 'johnson'.

Which is kind of difficult if you're searching for Johnson and Johnson!

Here's a image of the analysis happening:

http://imgur.com/BUuNT

Two questions,

1. I would have expected the catenated token to show up at the same
position as the left hand side token, since they begin with the same
letters. Does that not make sense?

2. Does it make sense to filter out apostrophes prior to WordDelimiter
to prevent this from happening, or will that cause other issues?

Thanks,

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game

Word Delimiter issue

Reply via email to