Regarding stemmers, I ditched them altogether a long time ago in favor of a dictionary of morphologies of all known words (for any given language). A simple lookup of any word morphology thus produces the set, including the correct stem.
Works great. 100% of the time. Just a tip from me. On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote: > Andy, I think it is important to know what a stemmer really is. > > It reduces words to their infinitves. Those infinitives do not refer to the > real infinitive everytime, but however: for the system, it is an infinitive, > since all its derivates could be reduced to the same form. > Thats a stemmer. > > According to this, there can't exist a stemmer for every language, because > every language has got its own rules of how to reduce a word to its > infinitive. > > If you apply a stemmer for english language on a german document, the > results might be unexpected. However, sometimes it still works good enough. > > Keep in mind that this is an algorithm. It is not important whether the > created infinitive is the real infinitive. It is only important that most of > the derivate forms can be reduced to the same basic form. Please ask, if > something is not clear. > > KStem: > The wiki[1] says that KStem is less aggressive as the standard stemmer. > I guess that this means that there are more rules for how to reduce a word > to its infinitive and according to this the results might be better. > > > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem > > Kind regards > - Mitch