Sure. Let me get a serialized version of the training dataset put together.
It shouldn't be too big. Thank you! This one is really confusing. Also, I
updated to the UIMA 3.0 and opennlp 1.9. Maybe that changes things. I will
get that put together tonight. I will try and export it to a tag <space>
<text> format for training. That shouldn't be too hard.

Thank you!
~Ben

On Wed, Oct 17, 2018 at 4:33 PM Dan Russ <danrus...@gmail.com> wrote:

> Ok, so you are using the API. If you send out a small training example,
> I’ll train a DocumentModel and see what I get.
> Also send a few examples to test on.  We can compare the results.  It’s
> late in the day on the US East coast, so I may not be able to get to it
> until tomorrow.
> Daniel
>
>
> > On Oct 17, 2018, at 4:27 PM, Benedict Holland <
> benedict.m.holl...@gmail.com> wrote:
> >
> > I mean... not really? I store everything in a database. I created a
> stream
> > that reads my training data from the database, splits it up into tokens,
> > and creates DocumentSample objects. The DocumentSampleStream interface is
> > really easy to work with and I really didn't have to implement much.
> >
> > I am quite confident that I created the DocumentSample objects and the
> > stream properly. I mean, it is reading in the tokens and I am training
> > models. For me, this was basically a unit test. I created a very simple
> > model with a single observation in is_cat_1,  a bunch of observations in
> > is_not_cat_1, ran a document without any overlap against the model and it
> > predicted a probability over 90% of it belonging to is_cat_1.
> >
> > At that point, I posted this to the group. I really cann't figure out how
> > that would be possible. I could even email the training data to someone
> if
> > they would like.
> >
> > Before that point though, I think I am using the 1.8.4. I can try
> upgrading
> > to the 1.9.0 release.
> >
> > Thanks,
> > ~Ben
> >
> > On Wed, Oct 17, 2018 at 4:00 PM Dan Russ <danrus...@gmail.com> wrote:
> >
> >> Really surprised.  Looking at the documentation, your training data
> should
> >> be in the following format. See (
> >>
> https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training
> >> )
> >>
> >> Is_cat_1 <text>
> >> Is_not_cat_1 <text>
> >>
> >> Is that how you formatted your data?
> >> Daniel
> >>
> >>> On Oct 17, 2018, at 3:50 PM, Benedict Holland <
> >> benedict.m.holl...@gmail.com> wrote:
> >>>
> >>> Hi! Thanks for the reply.
> >>>
> >>> Yes. There is a massive imbalance.  Out of the thousands of
> observations
> >> I
> >>> have, only a small handful are actually positive observations in the
> >>> is_cat_1. The rest are in the is_not_cat_1. In some cases, the number
> of
> >>> positives are 1.
> >>>
> >>> For example:
> >>>
> >>> In one category, the only observation in is_cat_1 is:
> >>> assault use reckless force or vi
> >>>
> >>> I have a bunch of observations in the is_not_cat_1. This model placed
> >> this
> >>> text
> >>>
> >>> 0099 usc 18 usc 2
> >>>
> >>> has a probability match over 90%. Mind you, I expected this setup to be
> >>> horrible. I actually expected this sort of text to get close to a 100%
> in
> >>> the is_not_cat_1 but what I really cannot explain is the overlap. I did
> >>> verify that the BoW produces the following features:
> >>> ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or",
> >> "bow="vi"]
> >>>
> >>> The only thing I could come up with is something like each string being
> >>> broken apart into individual letters but that wouldn't make sense. Or
> >> would
> >>> it?
> >>>
> >>> Thanks,
> >>> ~Ben
> >>>
> >>> On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote:
> >>>
> >>>> Hi Ben,
> >>>>  Are you sure that your training documents are formatted
> appropriately?
> >>>> Also, do you have a large imbalance in the # of training documents?
> If
> >> the
> >>>> text in the testing document is not in either CAT_1 or the OTHER_CAT,
> >> there
> >>>> will be a .5 assignment to each category (assuming equal documents so
> >> the
> >>>> prior doesn’t change the value).  A .5 assignment is like “I can’t
> tell
> >> the
> >>>> two categories”.  You probably don’t want to think of it as “You don’t
> >> look
> >>>> like CAT_1 so you are NOT_CAT_1”.
> >>>> Daniel
> >>>>
> >>>>> On Oct 17, 2018, at 1:14 PM, Benedict Holland <
> >>>> benedict.m.holl...@gmail.com> wrote:
> >>>>>
> >>>>> Hello all,
> >>>>>
> >>>>> I can't quite figure out how the Doccat MaxEnt modeling works. Here
> is
> >> my
> >>>>> setup:
> >>>>>
> >>>>> I have a set of training texts split into is_cat_1 and is_not_cat_1.
> I
> >>>>> train my model using the default bag of words model. I have a
> document
> >>>>> without any overlapping text with texts that are in is_cat_1. They
> >> might
> >>>>> overlap with text in is_not_cat_1. Meaning, every single word in the
> >>>>> document I want to categorize does not appear in any of the model
> >>>> training
> >>>>> data in the is_cat_1 category.
> >>>>>
> >>>>> The result of the MaxEnt model for my document is a probability over
> >> 90%
> >>>>> that it fits into the is_cat_1. Why is that?
> >>>>>
> >>>>> Thanks,
> >>>>> ~Ben
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to