Re: TokenFilter not working at index time

Erick Erickson Tue, 24 Jun 2014 08:34:55 -0700

Hmmm. It would help if you posted a couple of other
pieces of information.... BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute


But to your question:
First couple of things I'd do:
1> see what the admin/analysis page tells you happens.
2> attach &debug=query to your test case, see what the parsed
    query looks like.
3> use the admin/schema browser link for the field in question
   to see what actually makes it into the index. (Or use Luke or
   even the TermsComponent).

My bet is that 2 or 3 will show something unexpected which may
give you some clues.

Best,
Erick

On Tue, Jun 24, 2014 at 5:00 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote:
>
> I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for
> some odd reason I don't get any search results even thought the Analyzer in
> Solr Admin shows that it does the right thing. It works at query time if I
> have reindexed everything based on another stemmer, e.g.
> NorwegianMinimalStemmer.
>
> Here's a screenshot of how it lemmatizes the Norwegian word "studenter"
> (masculine indefinite noun, plural - English: "students"). The stem is
> "student". So far so good:
> http://folk.uio.no/erlendfg/solr/lemmatizer.png
>
> But I get no/few results if I search for "studenter" compared to "student".
> If I switch to solr.NorwegianMinimalStemFilterFactory in schema.xml at index
> time and reindexes everything, it works as it should:
> <analyzer type="index">
>   <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/>
>
> What is wrong with my TokenFilter and/or how can I debug this further? I
> have tried a lot of different things without any luck, for example decode
> everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm
> reading it properly by setting the correct character set) and trim all the
> words without any help. The byte sequence also seems to be correct for the
> stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same
> as when I have configured NorwegianMinimalStemFilterFactory in schema.xml.
>
> Here's the source code of my lemmatizer. Please note that it is not
> finished:
> http://folk.uio.no/erlendfg/solr/
>
> Here's the line in my wordlist which contains the word "studenter":
> 66235   student studenter       subst mask appell fl ub normert 700     3
>
> The following line returns the stem (input is "studenter"):
> final String[] values = stemmer.stem(termAtt.buffer());
>
> The rest of the code is in NorwegianLemmatizerFilter. If several stems are
> returned, they are all added.
>
> Erlend

Re: TokenFilter not working at index time

Reply via email to