AHMET ARSLAN schrieb:
If i analyse this field type in analysis.jsp, the follwoing
are the results
if i give "running" its stems word to run which is fine
If i give "machine" why is that it stems to "machin", now
from where does
this word come from
If i give "revolutionary" it stems to "revolutionari", i
thought it should
stem to revolution.

Stemmers used in Information Retrieval are not for human consumption. Reducing 
revolutionary to revolutionari do not change the fact that query revolutionary 
will return documents containing revolutionary.

How does stemming work?
Does it reduces adverb to verb etc..., or we have to
customize it.

Stemmers aim to remove inflectional suffixes from words. Snowball stemmers are 
rule based stemmers. Rules and endings are defined. e.g. if ending s remove it. 
apples -> apple

Stemmers should actually return the stem of the word which is neither an adverb nor verb (though it might coincede in some cases). This can be more than removing a suffix. It also includes prefix, umlaut, word concatenation etc. E.g. (I'm German, so my English examples might not be completely correct, sorry.) - the stem of "feet" is probably "foot-" (umlaut), which might also be the stem of "barefoot" (concatenation or prefix?). - however, the stem of "went" ("I went") is NOT "go-". Very common and old verbs use for past/future tense forms that originated from other words (word stems). ("is" and "be" look like different stems, as well.)

Citing http://en.wikipedia.org/wiki/Word_stem
"""
Some paradigms do not make use of the same stem throughout; this phenomenon is called suppletion. An example of a suppletive paradigm is the paradigm for the adjective good: its stem changes from good to the bound morpheme bet-.
good (positive); better (comparative); best (superlative)
"""

IMHO, the stem of "revolutionary" should be revol- or something like that. This would also cover "revolting". But this is probably not what you want get counted as the occurrance of the same term in your search index.

Lemmatization might be preferred in case of search engines as it takes Semantics (the meaning) into account.

http://en.wikipedia.org/wiki/Lemmatisation


Cheers,
Chantal



It will be difficult to customize existing snowball stemmers, i guess.
If you are looking for a less aggressive stemmer then you can use KStem.




--
Chantal Ackermann

Reply via email to