Problem with german hyphenated words not being found

Thomas Michael Engelke Thu, 11 Jun 2015 02:27:08 -0700

 Hey,

in german, you can string most nouns together by using hyphens, like
this:


Industrie = industry
Anhänger = trailer

Industrie-Anhänger = trailer for industrial use

Here [1], you can see me querying "Industrieanhänger" from the "name"
field (name:Industrieanhänger), to make sure the index actually contains
the word. Our data is structured that products are listed without the
hyphen.

Now, customers can come around and use the hyphenated version as a
search term (i.e."industrie-anhänger"), and of course we want them to
find what they are looking for. I've set it up so that the
WordDelimiterFilterFactory uses catenateWords="1", so that these words
are catenated. An analysis of "Industrieanhänger" as index and
"industrie-anhänger" as query can be seen here [2].

You can see that both word parts are found. However, querying for
"industrie-anhänger" does not yield results, only when the hyphen is
removed, as you can see here [3]. I'm not sure how to proceed from here,
as the results of the analysis have so far always lined up with what I
could see when querying. Here's the schema definition for "text", the
field type for the "name" field:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
 <analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="0" catenateAll="0"
preserveOriginal="1"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
maxSubwordSize="30" onlyLongestMatch="false"/>
 <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true" enablePositionIncrements="true" format="snowball"/>
 <filter class="solr.GermanNormalizationFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="German2"
protected="protwords.txt"/>
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
 <analyzer type="query">
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="0" catenateAll="0"
preserveOriginal="1"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
maxSubwordSize="30" onlyLongestMatch="false"/> -->
 <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true" enablePositionIncrements="true" format="snowball"/>
 <filter class="solr.GermanNormalizationFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="German2"
protected="protwords.txt"/>
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>

I've also thought it might be a problem with URL encoding not encoding
the hyphen, but replacing it with %2D didn't change the outcome (and was
probably wrong anyway).

Any help is greatly appreciated. 

Links:
------
[1] http://imgur.com/2oEC5vz
[2] http://i.imgur.com/H0AhEsF.png
[3] http://imgur.com/dzmMe7t

Problem with german hyphenated words not being found

Reply via email to