Hi, 

Thanks for the suggestion. Can you please elaborate on how I should implement 
this approach? Assuming I have two significant labels "SIG1" and "SIG2" and 
three less significant labels "LESS1","LESS2","LESS3". How should I re-label my 
datasets?

The approach I was following so far is two step training. First step training 
for significant labels and second step for less significant labels. This 
results in two models and when I get the real production data for 
classification, I compute the probability using these two models. Based on the 
value of probability, I assign the final label. Do you think this approach has 
any issues? Can I augment this approach with your suggestion of negative 
training samples? Please advise. 

Regards,
Anand.C

-----Original Message-----
From: Suneel Marthi [mailto:[email protected]] 
Sent: Monday, May 27, 2013 10:29 AM
To: [email protected]
Subject: Re: Handling unbalanced datasets in Mahout text classsification

You could use some of the 80% datasets as negative training examples for the 
ones that lack sufficient training data. 




________________________________
 From: "Chandra Mohan, Ananda Vel Murugan" <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Monday, May 27, 2013 12:50 AM
Subject: Handling unbalanced datasets in Mahout text classsification
 

Hi,

I am using  Naïve Bayes algorithm implementation in mahout for text 
classification.  My training dataset is very unbalanced. There are 121 
categories in my training dataset. There are 200000 training datasets. Out of 
this only few categories are predominant and they constitute almost 80% of the 
dataset. Remaining 100+ categories have very less dataset. Some of the 
categories contain just 3-4 datasets. How to handle unbalanced datasets in 
Mahout? Please suggest.

Regards,
Anand.C

Reply via email to