Sure. Let me get a serialized version of the training dataset put together. It shouldn't be too big. Thank you! This one is really confusing. Also, I updated to the UIMA 3.0 and opennlp 1.9. Maybe that changes things. I will get that put together tonight. I will try and export it to a tag <space> <text> format for training. That shouldn't be too hard.
Thank you! ~Ben On Wed, Oct 17, 2018 at 4:33 PM Dan Russ <danrus...@gmail.com> wrote: > Ok, so you are using the API. If you send out a small training example, > I’ll train a DocumentModel and see what I get. > Also send a few examples to test on. We can compare the results. It’s > late in the day on the US East coast, so I may not be able to get to it > until tomorrow. > Daniel > > > > On Oct 17, 2018, at 4:27 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > > I mean... not really? I store everything in a database. I created a > stream > > that reads my training data from the database, splits it up into tokens, > > and creates DocumentSample objects. The DocumentSampleStream interface is > > really easy to work with and I really didn't have to implement much. > > > > I am quite confident that I created the DocumentSample objects and the > > stream properly. I mean, it is reading in the tokens and I am training > > models. For me, this was basically a unit test. I created a very simple > > model with a single observation in is_cat_1, a bunch of observations in > > is_not_cat_1, ran a document without any overlap against the model and it > > predicted a probability over 90% of it belonging to is_cat_1. > > > > At that point, I posted this to the group. I really cann't figure out how > > that would be possible. I could even email the training data to someone > if > > they would like. > > > > Before that point though, I think I am using the 1.8.4. I can try > upgrading > > to the 1.9.0 release. > > > > Thanks, > > ~Ben > > > > On Wed, Oct 17, 2018 at 4:00 PM Dan Russ <danrus...@gmail.com> wrote: > > > >> Really surprised. Looking at the documentation, your training data > should > >> be in the following format. See ( > >> > https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training > >> ) > >> > >> Is_cat_1 <text> > >> Is_not_cat_1 <text> > >> > >> Is that how you formatted your data? > >> Daniel > >> > >>> On Oct 17, 2018, at 3:50 PM, Benedict Holland < > >> benedict.m.holl...@gmail.com> wrote: > >>> > >>> Hi! Thanks for the reply. > >>> > >>> Yes. There is a massive imbalance. Out of the thousands of > observations > >> I > >>> have, only a small handful are actually positive observations in the > >>> is_cat_1. The rest are in the is_not_cat_1. In some cases, the number > of > >>> positives are 1. > >>> > >>> For example: > >>> > >>> In one category, the only observation in is_cat_1 is: > >>> assault use reckless force or vi > >>> > >>> I have a bunch of observations in the is_not_cat_1. This model placed > >> this > >>> text > >>> > >>> 0099 usc 18 usc 2 > >>> > >>> has a probability match over 90%. Mind you, I expected this setup to be > >>> horrible. I actually expected this sort of text to get close to a 100% > in > >>> the is_not_cat_1 but what I really cannot explain is the overlap. I did > >>> verify that the BoW produces the following features: > >>> ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", > >> "bow="vi"] > >>> > >>> The only thing I could come up with is something like each string being > >>> broken apart into individual letters but that wouldn't make sense. Or > >> would > >>> it? > >>> > >>> Thanks, > >>> ~Ben > >>> > >>> On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote: > >>> > >>>> Hi Ben, > >>>> Are you sure that your training documents are formatted > appropriately? > >>>> Also, do you have a large imbalance in the # of training documents? > If > >> the > >>>> text in the testing document is not in either CAT_1 or the OTHER_CAT, > >> there > >>>> will be a .5 assignment to each category (assuming equal documents so > >> the > >>>> prior doesn’t change the value). A .5 assignment is like “I can’t > tell > >> the > >>>> two categories”. You probably don’t want to think of it as “You don’t > >> look > >>>> like CAT_1 so you are NOT_CAT_1”. > >>>> Daniel > >>>> > >>>>> On Oct 17, 2018, at 1:14 PM, Benedict Holland < > >>>> benedict.m.holl...@gmail.com> wrote: > >>>>> > >>>>> Hello all, > >>>>> > >>>>> I can't quite figure out how the Doccat MaxEnt modeling works. Here > is > >> my > >>>>> setup: > >>>>> > >>>>> I have a set of training texts split into is_cat_1 and is_not_cat_1. > I > >>>>> train my model using the default bag of words model. I have a > document > >>>>> without any overlapping text with texts that are in is_cat_1. They > >> might > >>>>> overlap with text in is_not_cat_1. Meaning, every single word in the > >>>>> document I want to categorize does not appear in any of the model > >>>> training > >>>>> data in the is_cat_1 category. > >>>>> > >>>>> The result of the MaxEnt model for my document is a probability over > >> 90% > >>>>> that it fits into the is_cat_1. Why is that? > >>>>> > >>>>> Thanks, > >>>>> ~Ben > >>>> > >>>> > >> > >> > >