A few questions about Classification

Salman Mahmood Thu, 30 Aug 2012 02:31:30 -0700

I have got few questions about classification in mahout:

1) It is said that if the data set is small, then SDG is suited and naive 
bayes, if the data set is medium. I also learned that bayes classifier is for 
textual data as compared to continuous data.
I am classifying around 10,000 news articles  and they are all textual data. 
(no continuous variable used to determine classification). In my opinion the 
data set is small, so should I use SGD or Naive Bayes? (given the data is 
textual.)


2) Since multi-labeling is not supported in Mahout, I generated around 4,000 
binary models using SGD. This way I know if a particular news item belongs to 
one or more classes ("Apple sues Samsung" belongs to class "Apple" and 
"Samsung"). 
The problem I am facing is the performance. It takes around 4 mins to classify 
one news item. Although the scalability is horizontal (meaning, it doesn't take 
8 minutes to classify 2 news, 12 mins to classify 3 news and so on..), I still 
want to improve the throughput. What I am doing is loading a particular model 
and classifying N news, then loading the other model and classifying the N news 
again. Through this approach I am getting 16 mins to classify 1000 news items 
with 75-100 words per news. Is there a way to improve this further?(one option 
I am thinking is to use hadoop for the classification task. is that possible?)

3) Where can I find some good code examples/tutorials of training and testing 
Mahout classifier using naive Bayes? There are lots of examples on the net, but 
they all use command line. I need to know the code for Naive bayes because I do 
not have the dataset in files, but instead in database. and command line option 
read the dataset from files. Mahout in Action book gives you a good walkthrough 
for the SGD code but not for Naive Bayes.

Thanks!

A few questions about Classification

Reply via email to