Hi! Thanks for the reply. Yes. There is a massive imbalance. Out of the thousands of observations I have, only a small handful are actually positive observations in the is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of positives are 1.
For example: In one category, the only observation in is_cat_1 is: assault use reckless force or vi I have a bunch of observations in the is_not_cat_1. This model placed this text 0099 usc 18 usc 2 has a probability match over 90%. Mind you, I expected this setup to be horrible. I actually expected this sort of text to get close to a 100% in the is_not_cat_1 but what I really cannot explain is the overlap. I did verify that the BoW produces the following features: ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", "bow="vi"] The only thing I could come up with is something like each string being broken apart into individual letters but that wouldn't make sense. Or would it? Thanks, ~Ben On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote: > Hi Ben, > Are you sure that your training documents are formatted appropriately? > Also, do you have a large imbalance in the # of training documents? If the > text in the testing document is not in either CAT_1 or the OTHER_CAT, there > will be a .5 assignment to each category (assuming equal documents so the > prior doesn’t change the value). A .5 assignment is like “I can’t tell the > two categories”. You probably don’t want to think of it as “You don’t look > like CAT_1 so you are NOT_CAT_1”. > Daniel > > > On Oct 17, 2018, at 1:14 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > > Hello all, > > > > I can't quite figure out how the Doccat MaxEnt modeling works. Here is my > > setup: > > > > I have a set of training texts split into is_cat_1 and is_not_cat_1. I > > train my model using the default bag of words model. I have a document > > without any overlapping text with texts that are in is_cat_1. They might > > overlap with text in is_not_cat_1. Meaning, every single word in the > > document I want to categorize does not appear in any of the model > training > > data in the is_cat_1 category. > > > > The result of the MaxEnt model for my document is a probability over 90% > > that it fits into the is_cat_1. Why is that? > > > > Thanks, > > ~Ben > >