Hello Peter, StemmerOverride wants \t separated fields, that is probably the cause of the AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a proper example listed. I recommend putting a decompounder before a stemmer, and have an accent (or ICU) folder as one of the last filters.
About the diff, it looks like KP output, it has the same issues with whether or not a word needs double or single vowels in the root. It also shows issues with strong verbs/nouns (beveel/bevool). Having this list seems like having KP configured so you should drop it, and only list exceptions to KP rules in the dict file. This is not easy, so i recommend to stay in to your domain's vocabulary. Also, unless you have a very specific need for it, drop the StopFilter. Nobody in these days should want a StopFilter unless they can justify it. We use them too, but only for very specific reasons, but never for text search. You might also want to have a WordDelimiterFilter as your first filter, look it up, you probably want to have it. Markus [1] https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html -----Original message----- > From:PeterKerk <petervdk...@hotmail.com> > Sent: Monday 12th March 2018 23:16 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > @Erick: thank you for clarifying! > > @Markus: > I feel like I'm not (or at least should not be :-)) the first person to run > into these challenges. > > "You can solve this by adding manual rules to StemmerOverrideFilter, but due > to the compound nature of words, you would need to add it for all the mills" > > After Googling I found this: > https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb > and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt > as stemdict_nl.txt > > My new fieldType definition now is: > > <fieldType name="searchtext_nl" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords_nl.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StemmerOverrideFilterFactory" > dictionary="stemdict_nl.txt"/> > <filter class="solr.SnowballPorterFilterFactory" language="Kp" > protected="protwords_nl.txt"></filter> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords_nl.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StemmerOverrideFilterFactory" > dictionary="stemdict_nl.txt"/> > <filter class="solr.SnowballPorterFilterFactory" language="Kp" > protected="protwords_nl.txt"></filter> > </analyzer> > </fieldType> > > I trimmed stemdict_nl.txt for testing to just this: > > aachen aach > aachener aachener > > But on full-import it throws a http 500 error: > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at > org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66) > > Is my stemdict_nl.txt format incorrect? > > And do you have examples of the HyphenationCompoundWordTokenFilter or > AccentFoldingFilter I can't find any. > > I use Solr 4.3.1 btw, not sure if that matters. > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >