or you may hook into the training part and give a higher weight to the very rare class with respect to the common class in order to make occurrences of that rare class have higher impact in changing the model parameters/weights.
Regards, Tommaso On Wed, 3 Jul 2019 at 17:51, Dan Russ <[email protected]> wrote: > You may have to run one class at a time and find a way to resolve cases > where more than 1 class wants a document. > Daniel > > > On Jul 3, 2019, at 11:49 AM, [email protected] wrote: > > > > Thanks, I am unfamiliar with the approaches that you mentioned - will > investigate. I forgot to mention that this is a multi-class classification > problem. Each sample represents a page of a corpus of document that have > been scanned and text extracted using OCR (thus noisy text) > > Label | Samples | %-------+---------+----------------C1 | 131613 | > 97.71C2 | 873 | 0.65C3 | 830 | 0.62C4 | 492 | > 0.37C5 | 456 | 0.34C6 | 430 | 0.32 > > - viraf > > > > > > On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ < > [email protected]> wrote: > > > > Have you considered using outlier detection methods? I’m not really an > expert on this, but maybe you can define your majority class very well, and > the other class is the outlier. Another option may be one-sided > classification (https://en.wikipedia.org/wiki/One-class_classification), > SVDD is an example of this. Finally, you might want to look at data > augmentation techniques. I am in the middle of some work using conditional > GANs, but it is not working out so great for me at the moment. > > > > Let me know if any of these work out for you. > > Daniel > > > > > >> On Jul 3, 2019, at 10:22 AM, [email protected] wrote: > >> > >> I am trying document classification using OpenNLP however my data is > highly unbalanced (majority class is 97%). I recognize that I could > randomly over/under sample the data set, and am reading up on SMOTE and > ADASYN (not sure how to apply these to OpenNLP). > >
