I have a problem with a stemmed german field. The field definition:
<field name="description" type="text_splitting" indexed="true"
stored="true" required="false" multiValued="false"/>
...
<fieldType name="text_splitting" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
When we search for a word from an autosuggest kind of component, we
always add an asterisk to a word, so when somebody enters something like
"Radbremszylinder" and waits for some milliseconds, the autosuggest list
is filled with the results of searching for "Radbremszylinder*". This
seemed to work quite well. Today we got a bug report from a customer for
that exact word.
So I made an analysis for the word as "Field value (index)" and "Field
value (query)", and it looked like this:
ST Radbremszylinder WT Radbremszylinder*
SF Radbremszylinder SF Radbremszylinder*
WDF Radbremszylinder SF Radbremszylinder*
LCF radbremszylinder WDF Radbremszylinder
SKMF radbremszylinder LCF radbremszylinder
PSF radbremszylind SKMF radbremszylinder
As you can see, the end result looks very much alike. However, records
containing that word in their "description" field aren't reported as
results. Strangely enough, records containing "Radbremszylindern"
(plural) are reported as results. Removing the asterisk from the end
reports all records with "Radbremszylinder", just as we would expect. So
the culprit is the asterisk at the end. As far as we can read from the
docs, an asterisk is just 0 or more characters, which means that the
literal word in front of the asterisk should match the query.
Searching further we tried some variations, and it seems that searching
for "Radbremszylind*" works. All records with any variation
("Radbremszylinder", "Radbremszylindern") are reported. So maybe there's
a weird interaction with stemming?
Any ideas?