I have got few questions about classification in mahout:
1) It is said that if the data set is small, then SDG is suited and naive
bayes, if the data set is medium. I also learned that bayes classifier is for
textual data as compared to continuous data.
I am classifying around 10,000 news articles and they are all textual data.
(no continuous variable used to determine classification). In my opinion the
data set is small, so should I use SGD or Naive Bayes? (given the data is
textual.)
2) Since multi-labeling is not supported in Mahout, I generated around 4,000
binary models using SGD. This way I know if a particular news item belongs to
one or more classes ("Apple sues Samsung" belongs to class "Apple" and
"Samsung").
The problem I am facing is the performance. It takes around 4 mins to classify
one news item. Although the scalability is horizontal (meaning, it doesn't take
8 minutes to classify 2 news, 12 mins to classify 3 news and so on..), I still
want to improve the throughput. What I am doing is loading a particular model
and classifying N news, then loading the other model and classifying the N news
again. Through this approach I am getting 16 mins to classify 1000 news items
with 75-100 words per news. Is there a way to improve this further?(one option
I am thinking is to use hadoop for the classification task. is that possible?)
3) Where can I find some good code examples/tutorials of training and testing
Mahout classifier using naive Bayes? There are lots of examples on the net, but
they all use command line. I need to know the code for Naive bayes because I do
not have the dataset in files, but instead in database. and command line option
read the dataset from files. Mahout in Action book gives you a good walkthrough
for the SGD code but not for Naive Bayes.
Thanks!