@Erick: thank you for clarifying!

@Markus:
I feel like I'm not (or at least should not be :-)) the first person to run
into these challenges.

"You can solve this by adding manual rules to StemmerOverrideFilter, but due
to the compound nature of words, you would need to add it for all the mills"

After Googling I found this:
https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
as stemdict_nl.txt

My new fieldType definition now is:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>               
                <filter class="solr.LowerCaseFilterFactory"/>           
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>          
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>          
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>   
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>                               
                <filter class="solr.LowerCaseFilterFactory"/> 
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>          
      </analyzer>
    </fieldType>                
        
I trimmed stemdict_nl.txt for testing to just this:

aachen                        aach
aachener                      aachener

But on full-import it throws a http 500 error:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1  at
org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)

Is my stemdict_nl.txt format incorrect?

And do you have examples of the HyphenationCompoundWordTokenFilter or
AccentFoldingFilter I can't find any.

I use Solr 4.3.1 btw, not sure if that matters.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to