TokenFilter not working at index time

Erlend Garåsen Tue, 24 Jun 2014 05:00:46 -0700

I'm trying to create a Norwegian Lemmatizer based on a dictionary, butfor some odd reason I don't get any search results even thought theAnalyzer in Solr Admin shows that it does the right thing. It works atquery time if I have reindexed everything based on another stemmer, e.g.NorwegianMinimalStemmer.

Here's a screenshot of how it lemmatizes the Norwegian word "studenter"(masculine indefinite noun, plural - English: "students"). The stem is"student". So far so good:

http://folk.uio.no/erlendfg/solr/lemmatizer.png

But I get no/few results if I search for "studenter" compared to"student". If I switch to solr.NorwegianMinimalStemFilterFactory inschema.xml at index time and reindexes everything, it works as it should:

<analyzer type="index">
  <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/>

What is wrong with my TokenFilter and/or how can I debug this further? Ihave tried a lot of different things without any luck, for exampledecode everything explicitly to UTF8 (the wordlist is in iso-8859-1, butI'm reading it properly by setting the correct character set) and trimall the words without any help. The byte sequence also seems to becorrect for the stemmed word. My lemmatizer shows [73 74 75 64 65 6e74], exactly the same as when I have configuredNorwegianMinimalStemFilterFactory in schema.xml.

Here's the source code of my lemmatizer. Please note that it is notfinished:

http://folk.uio.no/erlendfg/solr/

Here's the line in my wordlist which contains the word "studenter":
66235   student studenter       subst mask appell fl ub normert 700     3

The following line returns the stem (input is "studenter"):
final String[] values = stemmer.stem(termAtt.buffer());

The rest of the code is in NorwegianLemmatizerFilter. If several stemsare returned, they are all added.


Erlend

TokenFilter not working at index time

Reply via email to