Hi! Thanks for the reply.

Yes. There is a massive imbalance.  Out of the thousands of observations I
have, only a small handful are actually positive observations in the
is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of
positives are 1.

For example:

In one category, the only observation in is_cat_1 is:
assault use reckless force or vi

I have a bunch of observations in the is_not_cat_1. This model placed this
text

0099 usc 18 usc 2

has a probability match over 90%. Mind you, I expected this setup to be
horrible. I actually expected this sort of text to get close to a 100% in
the is_not_cat_1 but what I really cannot explain is the overlap. I did
verify that the BoW produces the following features:
["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", "bow="vi"]

The only thing I could come up with is something like each string being
broken apart into individual letters but that wouldn't make sense. Or would
it?

Thanks,
~Ben

On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote:

> Hi Ben,
>    Are you sure that your training documents are formatted appropriately?
> Also, do you have a large imbalance in the # of training documents?  If the
> text in the testing document is not in either CAT_1 or the OTHER_CAT, there
> will be a .5 assignment to each category (assuming equal documents so the
> prior doesn’t change the value).  A .5 assignment is like “I can’t tell the
> two categories”.  You probably don’t want to think of it as “You don’t look
> like CAT_1 so you are NOT_CAT_1”.
> Daniel
>
> > On Oct 17, 2018, at 1:14 PM, Benedict Holland <
> benedict.m.holl...@gmail.com> wrote:
> >
> > Hello all,
> >
> > I can't quite figure out how the Doccat MaxEnt modeling works. Here is my
> > setup:
> >
> > I have a set of training texts split into is_cat_1 and is_not_cat_1. I
> > train my model using the default bag of words model. I have a document
> > without any overlapping text with texts that are in is_cat_1. They might
> > overlap with text in is_not_cat_1. Meaning, every single word in the
> > document I want to categorize does not appear in any of the model
> training
> > data in the is_cat_1 category.
> >
> > The result of the MaxEnt model for my document is a probability over 90%
> > that it fits into the is_cat_1. Why is that?
> >
> > Thanks,
> > ~Ben
>
>

Reply via email to