Re: Document Classification with imbalanced data

Tommaso Teofili Thu, 18 Jul 2019 01:00:10 -0700

or you may hook into the training part and give a higher weight to the very
rare class with respect to the common class in order to make occurrences of
that rare class have higher impact in changing the model parameters/weights.


Regards,
Tommaso

On Wed, 3 Jul 2019 at 17:51, Dan Russ <[email protected]> wrote:

> You may have to run one class at a time and find a way to resolve cases
> where more than 1 class wants a document.
> Daniel
>
> > On Jul 3, 2019, at 11:49 AM, [email protected] wrote:
> >
> > Thanks, I am unfamiliar with the approaches that you mentioned - will
> investigate.  I forgot to mention that this is a multi-class classification
> problem.  Each sample represents a page of a corpus of document that have
> been scanned and text extracted using OCR (thus noisy text)
> > Label  | Samples | %-------+---------+----------------C1     | 131613  |
> 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |
> 0.37C5     |    456  |  0.34C6     |    430  |  0.32
> > - viraf
> >
> >
> >    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <
> [email protected]> wrote:
> >
> > Have you considered using outlier detection methods?  I’m not really an
> expert on this, but maybe you can define your majority class very well, and
> the other class is the outlier.  Another option may be one-sided
> classification (https://en.wikipedia.org/wiki/One-class_classification),
> SVDD is an example of this. Finally, you might want to look at data
> augmentation techniques.  I am in the middle of some work using conditional
> GANs, but it is not working out so great for me at the moment.
> >
> > Let me know if any of these work out for you.
> > Daniel
> >
> >
> >> On Jul 3, 2019, at 10:22 AM, [email protected] wrote:
> >>
> >> I am trying document classification using OpenNLP however my data is
> highly unbalanced (majority class is 97%).  I recognize that I could
> randomly over/under sample the data set, and am reading up on SMOTE and
> ADASYN (not sure how to apply these to OpenNLP).
>
>

Re: Document Classification with imbalanced data

Reply via email to