Re: Document Classification with imbalanced data

Dan Russ Wed, 03 Jul 2019 08:51:41 -0700

You may have to run one class at a time and find a way to resolve cases where 
more than 1 class wants a document.
Daniel


> On Jul 3, 2019, at 11:49 AM, viraf.bankwa...@yahoo.com.INVALID wrote:
> 
> Thanks, I am unfamiliar with the approaches that you mentioned - will 
> investigate.  I forgot to mention that this is a multi-class classification 
> problem.  Each sample represents a page of a corpus of document that have 
> been scanned and text extracted using OCR (thus noisy text)
> Label  | Samples | %-------+---------+----------------C1     | 131613  | 
> 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  
> 0.37C5     |    456  |  0.34C6     |    430  |  0.32
> - viraf
> 
> 
>    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ 
> <danrus...@gmail.com> wrote:  
> 
> Have you considered using outlier detection methods?  I’m not really an 
> expert on this, but maybe you can define your majority class very well, and 
> the other class is the outlier.  Another option may be one-sided 
> classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD 
> is an example of this. Finally, you might want to look at data augmentation 
> techniques.  I am in the middle of some work using conditional GANs, but it 
> is not working out so great for me at the moment.
> 
> Let me know if any of these work out for you.
> Daniel
> 
> 
>> On Jul 3, 2019, at 10:22 AM, viraf.bankwa...@yahoo.com.INVALID wrote:
>> 
>> I am trying document classification using OpenNLP however my data is highly 
>> unbalanced (majority class is 97%).  I recognize that I could randomly 
>> over/under sample the data set, and am reading up on SMOTE and ADASYN (not 
>> sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Reply via email to