Re: TokenFilter not working at index time

Erlend Garåsen Thu, 26 Jun 2014 04:36:34 -0700

I found the root of the problem. This is very strange, but I guesssomeone can explain to me why this happens.


Take a look at the static block in my factory:
http://folk.uio.no/erlendfg/solr/NorwegianLemmatizerFilterFactory.java

static {
  ...
}

If I remove this block and return a stemmed dummy string in myNorwegianLemmatizer class instead, ALL the tokens will be affected, notonly those belonging to the title field:

--8<--
// final String stem = wordlist.get(word.trim());
final String stem = "tmp";
--8<--

To summarize my previous post: My stemmer works, but only for the titlefield. The content field is not affected.

So it is probably a bad idea to have such a static block in the factoryclass. The reason why I added it was to populate the hash table which isalso static.


Erlend

On 25.06.14 15:53, Erlend Garåsen wrote:

On 24.06.14 17:33, Erick Erickson wrote:

Hmmm. It would help if you posted a couple of other
pieces of information.... BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute


All my other language improvements for the existing Norwegian stemmers
have been donated back to Solr, so yes, if possible. I want to
experiment a little bit before I open a ticket.

But to your question:
First couple of things I'd do:
1> see what the admin/analysis page tells you happens.


Shows correct results for index and query. The lemmatizer is enable to
find the correct stem.

2> attach &debug=query to your test case, see what the parsed
     query looks like.


Seems to be OK. Remember that the problem is related to indexing, not
querying. I have double-checked by indexing all the documents by another
stemmer and configured my lemmatizer only for queries. Then everything
works as it should. Here's the query. As you can see, "studentene" is
stemmed to "student" for two fields (content_no and title_no) which is
correct:

BoostedQuery(boost(+(title_en:studentene^10.0 | host:studentene^30.0 |
content_en:studentene^0.1 | content_no:student^0.1 |
title_no:student^10.0 | anchortext_partial:studentene^70.0 |
subjectcode:studentene^100.0 | canonicalurl:studentene^5.0)~0.2 () () ()
() () (product(int(url_toplevel),const(5)))^20.0
(2.0/(1.0*float(int(url_levels))+1.0))^250.0
(product(float(docrank),const(10000)))^4.0
(1.0/(3.16E-11*float(ms(const(1403686863701),date(last_modified)))+1.0))^50.0
(product(int(url_landingpage),const(3)))^40.0,product(float(urlboost),map(query(language:no,def=0.0),0.0,0.0,1.0))))

3> use the admin/schema browser link for the field in question
    to see what actually makes it into the index. (Or use Luke or
    even the TermsComponent).


I haven't played much around with this, but is says "27" for "docs" if I
select the field "content_no". Does this mean that there are only 27
documents in my index with data in this field? Then there is something
really bad going on, because if I change to content_en, this number
grows to 10526 (because another English stemmer is used for that field
instead).

If I change to NorwegianMinimalStemFilter and reindex everything, the
number grows to 28270.

By writing out debugging info from my stemmer, I just figured out that
only the document's titles are being stemmed at index time, not the
content itself. So I have found the root of the problem, but I'm not
sure why the field is omitted.

Erlend

Re: TokenFilter not working at index time

Reply via email to