I have a problem with a stemmed german field. The field definition:

<field name="description" type="text_splitting" indexed="true" stored="true" required="false" multiValued="false"/>
...
<fieldType name="text_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

When we search for a word from an autosuggest kind of component, we always add an asterisk to a word, so when somebody enters something like "Radbremszylinder" and waits for some milliseconds, the autosuggest list is filled with the results of searching for "Radbremszylinder*". This seemed to work quite well. Today we got a bug report from a customer for that exact word.

So I made an analysis for the word as "Field value (index)" and "Field value (query)", and it looked like this:

ST   Radbremszylinder                WT   Radbremszylinder*
SF   Radbremszylinder                SF   Radbremszylinder*
WDF  Radbremszylinder                SF   Radbremszylinder*
LCF  radbremszylinder                WDF  Radbremszylinder
SKMF radbremszylinder                LCF  radbremszylinder
PSF  radbremszylind                  SKMF radbremszylinder

As you can see, the end result looks very much alike. However, records containing that word in their "description" field aren't reported as results. Strangely enough, records containing "Radbremszylindern" (plural) are reported as results. Removing the asterisk from the end reports all records with "Radbremszylinder", just as we would expect. So the culprit is the asterisk at the end. As far as we can read from the docs, an asterisk is just 0 or more characters, which means that the literal word in front of the asterisk should match the query.

Searching further we tried some variations, and it seems that searching for "Radbremszylind*" works. All records with any variation ("Radbremszylinder", "Radbremszylindern") are reported. So maybe there's a weird interaction with stemming?

Any ideas?

Reply via email to