Thanks, I am unfamiliar with the approaches that you mentioned - will investigate. I forgot to mention that this is a multi-class classification problem. Each sample represents a page of a corpus of document that have been scanned and text extracted using OCR (thus noisy text) Label | Samples | %-------+---------+----------------C1 | 131613 | 97.71C2 | 873 | 0.65C3 | 830 | 0.62C4 | 492 | 0.37C5 | 456 | 0.34C6 | 430 | 0.32 - viraf
On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <danrus...@gmail.com> wrote: Have you considered using outlier detection methods? I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier. Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques. I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment. Let me know if any of these work out for you. Daniel > On Jul 3, 2019, at 10:22 AM, viraf.bankwa...@yahoo.com.INVALID wrote: > > I am trying document classification using OpenNLP however my data is highly > unbalanced (majority class is 97%). I recognize that I could randomly > over/under sample the data set, and am reading up on SMOTE and ADASYN (not > sure how to apply these to OpenNLP).