Re: Handling unbalanced datasets in Mahout text classsification

Ted Dunning Mon, 27 May 2013 10:15:19 -0700

Complementary Naive Bayes uses the negative signals to help in cases like
this.


See Rennie's papers.

http://qwone.com/~jason/papers/sm-thesis.pdf
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_RennieSTK03.pdf



On Sun, May 26, 2013 at 10:22 PM, Chandra Mohan, Ananda Vel Murugan <
[email protected]> wrote:

> Hi,
>
> Thanks for the suggestion. Can you please elaborate on how I should
> implement this approach? Assuming I have two significant labels "SIG1" and
> "SIG2" and three less significant labels "LESS1","LESS2","LESS3". How
> should I re-label my datasets?
>
> The approach I was following so far is two step training. First step
> training for significant labels and second step for less significant
> labels. This results in two models and when I get the real production data
> for classification, I compute the probability using these two models. Based
> on the value of probability, I assign the final label. Do you think this
> approach has any issues? Can I augment this approach with your suggestion
> of negative training samples? Please advise.
>
> Regards,
> Anand.C
>
> -----Original Message-----
> From: Suneel Marthi [mailto:[email protected]]
> Sent: Monday, May 27, 2013 10:29 AM
> To: [email protected]
> Subject: Re: Handling unbalanced datasets in Mahout text classsification
>
> You could use some of the 80% datasets as negative training examples for
> the ones that lack sufficient training data.
>
>
>
>
> ________________________________
>  From: "Chandra Mohan, Ananda Vel Murugan" <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Monday, May 27, 2013 12:50 AM
> Subject: Handling unbalanced datasets in Mahout text classsification
>
>
> Hi,
>
> I am using  Naïve Bayes algorithm implementation in mahout for text
> classification.  My training dataset is very unbalanced. There are 121
> categories in my training dataset. There are 200000 training datasets. Out
> of this only few categories are predominant and they constitute almost 80%
> of the dataset. Remaining 100+ categories have very less dataset. Some of
> the categories contain just 3-4 datasets. How to handle unbalanced datasets
> in Mahout? Please suggest.
>
> Regards,
> Anand.C
>

Re: Handling unbalanced datasets in Mahout text classsification

Reply via email to