You could use some of the 80% datasets as negative training examples for the ones that lack sufficient training data.
________________________________ From: "Chandra Mohan, Ananda Vel Murugan" <[email protected]> To: "[email protected]" <[email protected]> Sent: Monday, May 27, 2013 12:50 AM Subject: Handling unbalanced datasets in Mahout text classsification Hi, I am using Naïve Bayes algorithm implementation in mahout for text classification. My training dataset is very unbalanced. There are 121 categories in my training dataset. There are 200000 training datasets. Out of this only few categories are predominant and they constitute almost 80% of the dataset. Remaining 100+ categories have very less dataset. Some of the categories contain just 3-4 datasets. How to handle unbalanced datasets in Mahout? Please suggest. Regards, Anand.C
