I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for some odd reason I don't get any search results even thought the Analyzer in Solr Admin shows that it does the right thing. It works at query time if I have reindexed everything based on another stemmer, e.g. NorwegianMinimalStemmer.

Here's a screenshot of how it lemmatizes the Norwegian word "studenter" (masculine indefinite noun, plural - English: "students"). The stem is "student". So far so good:
http://folk.uio.no/erlendfg/solr/lemmatizer.png

But I get no/few results if I search for "studenter" compared to "student". If I switch to solr.NorwegianMinimalStemFilterFactory in schema.xml at index time and reindexes everything, it works as it should:
<analyzer type="index">
  <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/>

What is wrong with my TokenFilter and/or how can I debug this further? I have tried a lot of different things without any luck, for example decode everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm reading it properly by setting the correct character set) and trim all the words without any help. The byte sequence also seems to be correct for the stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same as when I have configured NorwegianMinimalStemFilterFactory in schema.xml.

Here's the source code of my lemmatizer. Please note that it is not finished:
http://folk.uio.no/erlendfg/solr/

Here's the line in my wordlist which contains the word "studenter":
66235   student studenter       subst mask appell fl ub normert 700     3

The following line returns the stem (input is "studenter"):
final String[] values = stemmer.stem(termAtt.buffer());

The rest of the code is in NorwegianLemmatizerFilter. If several stems are returned, they are all added.

Erlend

Reply via email to