Hi all.
I am trying to use Mahout for terminology extraction:
I have ~140 classes, each of which contains ~100 text documents.
The class categories are distinct but may overlap a bit.
I want to extract terms related to the label, for example if I have a
"dogs" category,
the terms "canine", "German Sheppard", "bone" may be related to the
category.
What I have come up with in the meantime was:
1. Learn a classifier using Mahout.
2. Look at term weights for the classifier - terms with high weights are
suspect as representing the category.
I currently only use Naive Bayes, with ng=1.
My questions are:
a. Is this a good setting for the problem at hand? Or does Mahout have a
better algorithm for this?
b. Which Mahout classifier is best for this? I chose Naive Bayes first
because its parameters have a simple interpretation.
Which other (stronger) classifiers also have this property?
TIA,
Yuval

Reply via email to