On 24.06.14 17:33, Erick Erickson wrote:
Hmmm. It would help if you posted a couple of other
pieces of information.... BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute

All my other language improvements for the existing Norwegian stemmers have been donated back to Solr, so yes, if possible. I want to experiment a little bit before I open a ticket.

But to your question:
First couple of things I'd do:
1> see what the admin/analysis page tells you happens.

Shows correct results for index and query. The lemmatizer is enable to find the correct stem.

2> attach &debug=query to your test case, see what the parsed
     query looks like.

Seems to be OK. Remember that the problem is related to indexing, not querying. I have double-checked by indexing all the documents by another stemmer and configured my lemmatizer only for queries. Then everything works as it should. Here's the query. As you can see, "studentene" is stemmed to "student" for two fields (content_no and title_no) which is correct:

BoostedQuery(boost(+(title_en:studentene^10.0 | host:studentene^30.0 | content_en:studentene^0.1 | content_no:student^0.1 | title_no:student^10.0 | anchortext_partial:studentene^70.0 | subjectcode:studentene^100.0 | canonicalurl:studentene^5.0)~0.2 () () () () () (product(int(url_toplevel),const(5)))^20.0 (2.0/(1.0*float(int(url_levels))+1.0))^250.0 (product(float(docrank),const(10000)))^4.0 (1.0/(3.16E-11*float(ms(const(1403686863701),date(last_modified)))+1.0))^50.0 (product(int(url_landingpage),const(3)))^40.0,product(float(urlboost),map(query(language:no,def=0.0),0.0,0.0,1.0))))

3> use the admin/schema browser link for the field in question
    to see what actually makes it into the index. (Or use Luke or
    even the TermsComponent).

I haven't played much around with this, but is says "27" for "docs" if I select the field "content_no". Does this mean that there are only 27 documents in my index with data in this field? Then there is something really bad going on, because if I change to content_en, this number grows to 10526 (because another English stemmer is used for that field instead).

If I change to NorwegianMinimalStemFilter and reindex everything, the number grows to 28270.

By writing out debugging info from my stemmer, I just figured out that only the document's titles are being stemmed at index time, not the content itself. So I have found the root of the problem, but I'm not sure why the field is omitted.

Erlend

Reply via email to