Hello,

I need to work with an external stemmer, which is accessible as a COM
object. I managed to integrate this using the com4j library. I tried two
scenario's:
1. Create a custom FilterFactory and Filter class for this. The external
stemmer is then invoked for every token
2. Create a custom TokenizerFactory, that invokes the external stemmer for
the entire search text, then puts the result of this into a StringReader,
and finally returns new WhitespaceTokenizer(stringReader), so the stemmed
text gets tokenized by the whitespace tokenizer.

Both scenario's appear to work from a functional point of view. The first
scenario however is to slow because of the overhead of calling the external
COM object. The second scenario is much faster, and also gives correct
search results. However, this then gives problems with highlighting -
sometimes, errors are reported (String out of Range), in other cases, I get
incorrect highlight fragments. Without knowing all details about this stuff,
this makes sense because of the change done to the text to be processed (I
guess positions get messed up then).  Maybe my second scenario is totally
insane?

Any ideas on how to overcome this?

Cheers,

Jaco.

Reply via email to