Ok, so you are using the API. If you send out a small training example, I’ll 
train a DocumentModel and see what I get. 
Also send a few examples to test on.  We can compare the results.  It’s late in 
the day on the US East coast, so I may not be able to get to it until tomorrow.
Daniel


> On Oct 17, 2018, at 4:27 PM, Benedict Holland <benedict.m.holl...@gmail.com> 
> wrote:
> 
> I mean... not really? I store everything in a database. I created a stream
> that reads my training data from the database, splits it up into tokens,
> and creates DocumentSample objects. The DocumentSampleStream interface is
> really easy to work with and I really didn't have to implement much.
> 
> I am quite confident that I created the DocumentSample objects and the
> stream properly. I mean, it is reading in the tokens and I am training
> models. For me, this was basically a unit test. I created a very simple
> model with a single observation in is_cat_1,  a bunch of observations in
> is_not_cat_1, ran a document without any overlap against the model and it
> predicted a probability over 90% of it belonging to is_cat_1.
> 
> At that point, I posted this to the group. I really cann't figure out how
> that would be possible. I could even email the training data to someone if
> they would like.
> 
> Before that point though, I think I am using the 1.8.4. I can try upgrading
> to the 1.9.0 release.
> 
> Thanks,
> ~Ben
> 
> On Wed, Oct 17, 2018 at 4:00 PM Dan Russ <danrus...@gmail.com> wrote:
> 
>> Really surprised.  Looking at the documentation, your training data should
>> be in the following format. See (
>> https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training
>> )
>> 
>> Is_cat_1 <text>
>> Is_not_cat_1 <text>
>> 
>> Is that how you formatted your data?
>> Daniel
>> 
>>> On Oct 17, 2018, at 3:50 PM, Benedict Holland <
>> benedict.m.holl...@gmail.com> wrote:
>>> 
>>> Hi! Thanks for the reply.
>>> 
>>> Yes. There is a massive imbalance.  Out of the thousands of observations
>> I
>>> have, only a small handful are actually positive observations in the
>>> is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of
>>> positives are 1.
>>> 
>>> For example:
>>> 
>>> In one category, the only observation in is_cat_1 is:
>>> assault use reckless force or vi
>>> 
>>> I have a bunch of observations in the is_not_cat_1. This model placed
>> this
>>> text
>>> 
>>> 0099 usc 18 usc 2
>>> 
>>> has a probability match over 90%. Mind you, I expected this setup to be
>>> horrible. I actually expected this sort of text to get close to a 100% in
>>> the is_not_cat_1 but what I really cannot explain is the overlap. I did
>>> verify that the BoW produces the following features:
>>> ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or",
>> "bow="vi"]
>>> 
>>> The only thing I could come up with is something like each string being
>>> broken apart into individual letters but that wouldn't make sense. Or
>> would
>>> it?
>>> 
>>> Thanks,
>>> ~Ben
>>> 
>>> On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <danrus...@gmail.com> wrote:
>>> 
>>>> Hi Ben,
>>>>  Are you sure that your training documents are formatted appropriately?
>>>> Also, do you have a large imbalance in the # of training documents?  If
>> the
>>>> text in the testing document is not in either CAT_1 or the OTHER_CAT,
>> there
>>>> will be a .5 assignment to each category (assuming equal documents so
>> the
>>>> prior doesn’t change the value).  A .5 assignment is like “I can’t tell
>> the
>>>> two categories”.  You probably don’t want to think of it as “You don’t
>> look
>>>> like CAT_1 so you are NOT_CAT_1”.
>>>> Daniel
>>>> 
>>>>> On Oct 17, 2018, at 1:14 PM, Benedict Holland <
>>>> benedict.m.holl...@gmail.com> wrote:
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I can't quite figure out how the Doccat MaxEnt modeling works. Here is
>> my
>>>>> setup:
>>>>> 
>>>>> I have a set of training texts split into is_cat_1 and is_not_cat_1. I
>>>>> train my model using the default bag of words model. I have a document
>>>>> without any overlapping text with texts that are in is_cat_1. They
>> might
>>>>> overlap with text in is_not_cat_1. Meaning, every single word in the
>>>>> document I want to categorize does not appear in any of the model
>>>> training
>>>>> data in the is_cat_1 category.
>>>>> 
>>>>> The result of the MaxEnt model for my document is a probability over
>> 90%
>>>>> that it fits into the is_cat_1. Why is that?
>>>>> 
>>>>> Thanks,
>>>>> ~Ben
>>>> 
>>>> 
>> 
>> 

Reply via email to