Ok, so you are using the API. If you send out a small training example, I’ll train a DocumentModel and see what I get. Also send a few examples to test on. We can compare the results. It’s late in the day on the US East coast, so I may not be able to get to it until tomorrow. Daniel
> On Oct 17, 2018, at 4:27 PM, Benedict Holland <benedict.m.holl...@gmail.com> > wrote: > > I mean... not really? I store everything in a database. I created a stream > that reads my training data from the database, splits it up into tokens, > and creates DocumentSample objects. The DocumentSampleStream interface is > really easy to work with and I really didn't have to implement much. > > I am quite confident that I created the DocumentSample objects and the > stream properly. I mean, it is reading in the tokens and I am training > models. For me, this was basically a unit test. I created a very simple > model with a single observation in is_cat_1, a bunch of observations in > is_not_cat_1, ran a document without any overlap against the model and it > predicted a probability over 90% of it belonging to is_cat_1. > > At that point, I posted this to the group. I really cann't figure out how > that would be possible. I could even email the training data to someone if > they would like. > > Before that point though, I think I am using the 1.8.4. I can try upgrading > to the 1.9.0 release. > > Thanks, > ~Ben > > On Wed, Oct 17, 2018 at 4:00 PM Dan Russ <danrus...@gmail.com> wrote: > >> Really surprised. Looking at the documentation, your training data should >> be in the following format. See ( >> https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training >> ) >> >> Is_cat_1 <text> >> Is_not_cat_1 <text> >> >> Is that how you formatted your data? >> Daniel >> >>> On Oct 17, 2018, at 3:50 PM, Benedict Holland < >> benedict.m.holl...@gmail.com> wrote: >>> >>> Hi! Thanks for the reply. >>> >>> Yes. There is a massive imbalance. Out of the thousands of observations >> I >>> have, only a small handful are actually positive observations in the >>> is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of >>> positives are 1. >>> >>> For example: >>> >>> In one category, the only observation in is_cat_1 is: >>> assault use reckless force or vi >>> >>> I have a bunch of observations in the is_not_cat_1. This model placed >> this >>> text >>> >>> 0099 usc 18 usc 2 >>> >>> has a probability match over 90%. Mind you, I expected this setup to be >>> horrible. I actually expected this sort of text to get close to a 100% in >>> the is_not_cat_1 but what I really cannot explain is the overlap. I did >>> verify that the BoW produces the following features: >>> ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", >> "bow="vi"] >>> >>> The only thing I could come up with is something like each string being >>> broken apart into individual letters but that wouldn't make sense. Or >> would >>> it? >>> >>> Thanks, >>> ~Ben >>> >>> On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote: >>> >>>> Hi Ben, >>>> Are you sure that your training documents are formatted appropriately? >>>> Also, do you have a large imbalance in the # of training documents? If >> the >>>> text in the testing document is not in either CAT_1 or the OTHER_CAT, >> there >>>> will be a .5 assignment to each category (assuming equal documents so >> the >>>> prior doesn’t change the value). A .5 assignment is like “I can’t tell >> the >>>> two categories”. You probably don’t want to think of it as “You don’t >> look >>>> like CAT_1 so you are NOT_CAT_1”. >>>> Daniel >>>> >>>>> On Oct 17, 2018, at 1:14 PM, Benedict Holland < >>>> benedict.m.holl...@gmail.com> wrote: >>>>> >>>>> Hello all, >>>>> >>>>> I can't quite figure out how the Doccat MaxEnt modeling works. Here is >> my >>>>> setup: >>>>> >>>>> I have a set of training texts split into is_cat_1 and is_not_cat_1. I >>>>> train my model using the default bag of words model. I have a document >>>>> without any overlapping text with texts that are in is_cat_1. They >> might >>>>> overlap with text in is_not_cat_1. Meaning, every single word in the >>>>> document I want to categorize does not appear in any of the model >>>> training >>>>> data in the is_cat_1 category. >>>>> >>>>> The result of the MaxEnt model for my document is a probability over >> 90% >>>>> that it fits into the is_cat_1. Why is that? >>>>> >>>>> Thanks, >>>>> ~Ben >>>> >>>> >> >>