Hello,

I need to work with an external stemmer in Solr. This stemmer is accessible
as a COM object (running Solr in tomcat on Windows platform). I managed to
integrate this using the com4j library. I tried two scenario's:
1. Create a custom FilterFactory and Filter class for this. The external
stemmer is then invoked for every token
2. Create a custom TokenizerFactory, that invokes the external stemmer for
the entire search text, then puts the result of this into a StringReader,
and finally returns new WhitespaceTokenizer(stringReader), so the stemmed
text gets tokenized by the whitespace tokenizer.

Looking at search results, both scenario's appear to work from a functional
point of view. The first scenario however is too slow because of the
overhead of calling the external COM object for each token.

The second scenario is much faster, and also gives correct search results.
However, this then gives problems with highlighting - sometimes, errors are
reported (String out of Range), in other cases, I get incorrect highlight
fragments. Without knowing all details about this stuff, this makes sense
because of the change done to the text to be processed (I guess positions
get messed up then).  Maybe my second scenario is totally insane?

Any ideas on how to overcome this or any other suggestions on how to realise
this?

Cheers,

Jaco.

PS I posted this message yesterday, but it didn't come through, so this is
the 2nd try..

Reply via email to