Complementary Naive Bayes uses the negative signals to help in cases like this.
See Rennie's papers. http://qwone.com/~jason/papers/sm-thesis.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_RennieSTK03.pdf On Sun, May 26, 2013 at 10:22 PM, Chandra Mohan, Ananda Vel Murugan < [email protected]> wrote: > Hi, > > Thanks for the suggestion. Can you please elaborate on how I should > implement this approach? Assuming I have two significant labels "SIG1" and > "SIG2" and three less significant labels "LESS1","LESS2","LESS3". How > should I re-label my datasets? > > The approach I was following so far is two step training. First step > training for significant labels and second step for less significant > labels. This results in two models and when I get the real production data > for classification, I compute the probability using these two models. Based > on the value of probability, I assign the final label. Do you think this > approach has any issues? Can I augment this approach with your suggestion > of negative training samples? Please advise. > > Regards, > Anand.C > > -----Original Message----- > From: Suneel Marthi [mailto:[email protected]] > Sent: Monday, May 27, 2013 10:29 AM > To: [email protected] > Subject: Re: Handling unbalanced datasets in Mahout text classsification > > You could use some of the 80% datasets as negative training examples for > the ones that lack sufficient training data. > > > > > ________________________________ > From: "Chandra Mohan, Ananda Vel Murugan" <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Monday, May 27, 2013 12:50 AM > Subject: Handling unbalanced datasets in Mahout text classsification > > > Hi, > > I am using Naïve Bayes algorithm implementation in mahout for text > classification. My training dataset is very unbalanced. There are 121 > categories in my training dataset. There are 200000 training datasets. Out > of this only few categories are predominant and they constitute almost 80% > of the dataset. Remaining 100+ categories have very less dataset. Some of > the categories contain just 3-4 datasets. How to handle unbalanced datasets > in Mahout? Please suggest. > > Regards, > Anand.C >
