You may have to run one class at a time and find a way to resolve cases where more than 1 class wants a document. Daniel
> On Jul 3, 2019, at 11:49 AM, viraf.bankwa...@yahoo.com.INVALID wrote: > > Thanks, I am unfamiliar with the approaches that you mentioned - will > investigate. I forgot to mention that this is a multi-class classification > problem. Each sample represents a page of a corpus of document that have > been scanned and text extracted using OCR (thus noisy text) > Label | Samples | %-------+---------+----------------C1 | 131613 | > 97.71C2 | 873 | 0.65C3 | 830 | 0.62C4 | 492 | > 0.37C5 | 456 | 0.34C6 | 430 | 0.32 > - viraf > > > On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ > <danrus...@gmail.com> wrote: > > Have you considered using outlier detection methods? I’m not really an > expert on this, but maybe you can define your majority class very well, and > the other class is the outlier. Another option may be one-sided > classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD > is an example of this. Finally, you might want to look at data augmentation > techniques. I am in the middle of some work using conditional GANs, but it > is not working out so great for me at the moment. > > Let me know if any of these work out for you. > Daniel > > >> On Jul 3, 2019, at 10:22 AM, viraf.bankwa...@yahoo.com.INVALID wrote: >> >> I am trying document classification using OpenNLP however my data is highly >> unbalanced (majority class is 97%). I recognize that I could randomly >> over/under sample the data set, and am reading up on SMOTE and ADASYN (not >> sure how to apply these to OpenNLP).