RE: Solr search engine configuration

Markus Jelsma Mon, 12 Mar 2018 16:06:06 -0700

Hello Peter,

StemmerOverride wants \t separated fields, that is probably the cause of the 
AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a 
proper example listed. I recommend putting a decompounder before a stemmer, and 
have an accent (or ICU) folder as one of the last filters.


About the diff, it looks like KP output, it has the same issues with whether or 
not a word needs double or single vowels in the root. It also shows issues with 
strong verbs/nouns (beveel/bevool). Having this list seems like having KP 
configured so you should drop it, and only list exceptions to KP rules in the 
dict file. This is not easy, so i recommend to stay in to your domain's 
vocabulary.

Also, unless you have a very specific need for it, drop the StopFilter. Nobody 
in these days should want a StopFilter unless they can justify it. We use them 
too, but only for very specific reasons, but never for text search. You might 
also want to have a WordDelimiterFilter as your first filter, look it up, you 
probably want to have it.

Markus

[1] 
https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html

 
 
-----Original message-----
> From:PeterKerk <petervdk...@hotmail.com>
> Sent: Monday 12th March 2018 23:16
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> @Erick: thank you for clarifying!
> 
> @Markus:
> I feel like I'm not (or at least should not be :-)) the first person to run
> into these challenges.
> 
> "You can solve this by adding manual rules to StemmerOverrideFilter, but due
> to the compound nature of words, you would need to add it for all the mills"
> 
> After Googling I found this:
> https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
> and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
> as stemdict_nl.txt
> 
> My new fieldType definition now is:
> 
> <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/> 
> <filter class="solr.LowerCaseFilterFactory"/> 
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/> 
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter> 
>       </analyzer>
>       <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>   
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/> 
> <filter class="solr.LowerCaseFilterFactory"/> 
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter> 
>       </analyzer>
>     </fieldType> 
> 
> I trimmed stemdict_nl.txt for testing to just this:
> 
> aachen                        aach
> aachener                      aachener
> 
> But on full-import it throws a http 500 error:
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at
> org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)
> 
> Is my stemdict_nl.txt format incorrect?
> 
> And do you have examples of the HyphenationCompoundWordTokenFilter or
> AccentFoldingFilter I can't find any.
> 
> I use Solr 4.3.1 btw, not sure if that matters.
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

RE: Solr search engine configuration

Reply via email to