AHMET ARSLAN schrieb:
If i analyse this field type in analysis.jsp, the follwoing
are the results
if i give "running" its stems word to run which is fine
If i give "machine" why is that it stems to "machin", now
from where does
this word come from
If i give "revolutionary" it stems to "revolutionari", i
thought it should
stem to revolution.
Stemmers used in Information Retrieval are not for human consumption. Reducing
revolutionary to revolutionari do not change the fact that query revolutionary
will return documents containing revolutionary.
How does stemming work?
Does it reduces adverb to verb etc..., or we have to
customize it.
Stemmers aim to remove inflectional suffixes from words. Snowball stemmers are
rule based stemmers. Rules and endings are defined. e.g. if ending s remove it.
apples -> apple
Stemmers should actually return the stem of the word which is neither an
adverb nor verb (though it might coincede in some cases). This can be
more than removing a suffix. It also includes prefix, umlaut, word
concatenation etc.
E.g. (I'm German, so my English examples might not be completely
correct, sorry.)
- the stem of "feet" is probably "foot-" (umlaut), which might also be
the stem of "barefoot" (concatenation or prefix?).
- however, the stem of "went" ("I went") is NOT "go-". Very common and
old verbs use for past/future tense forms that originated from other
words (word stems). ("is" and "be" look like different stems, as well.)
Citing http://en.wikipedia.org/wiki/Word_stem
"""
Some paradigms do not make use of the same stem throughout; this
phenomenon is called suppletion. An example of a suppletive paradigm is
the paradigm for the adjective good: its stem changes from good to the
bound morpheme bet-.
good (positive); better (comparative); best (superlative)
"""
IMHO, the stem of "revolutionary" should be revol- or something like
that. This would also cover "revolting". But this is probably not what
you want get counted as the occurrance of the same term in your search
index.
Lemmatization might be preferred in case of search engines as it takes
Semantics (the meaning) into account.
http://en.wikipedia.org/wiki/Lemmatisation
Cheers,
Chantal
It will be difficult to customize existing snowball stemmers, i guess.
If you are looking for a less aggressive stemmer then you can use KStem.
--
Chantal Ackermann