Document Categorizer Tool questions

Armando Perez Tue, 08 Aug 2017 10:32:44 -0700

OpenNLP Team,

I’m new to this field and I have a couple of questions both having to do with 
the Apache OpenNLP Document Categorizer Tool.


I have already created the document categorizer model called “en-doccat.bin” 
from a flat file called en-doccat.train.  This model has a selection of over 40 
categories in the first column and user textual data in the second column after 
the first whitespace character.  This, user textual data, will become the 
tokenized data that will be used to try and predict the category that the user 
is trying to obtain.   Obviously, the tokenized user text data and the category 
will be given a percentage score given for each possible categorical outcome.

The first question I have is:   On the already created “en-doccat.bin” model, 
once a user selects a particular category, is there a way teach the model that 
for this tokenized user data that this category is correct?  Thus increasing 
the prediction percentage score with each correct answer.  And vice versa, 
decreasing with each wrong answer.

So as an example. I select category A from a GUI because it’s correct.  Now can 
take that tokenized user data and the category and tell the model that it’s 
prediction is correct?  Thus increasing its prediction score?

The second question is:  On the already created “en-doccat.bin” model,  for 
missing categories, is there a way to add new categories or text to existing 
categories without have to recreate the entire model from scratch?  Hence 
without having add the new data to the en-doccat.train file and recreating 
en-doccat.bin file all over again?  In other words, incrementally increase the 
size of the “en-doccat.bin” model file with new data?

The reason I ask the two questions, other than the obvious, is that without 
this, how does the machine learning actually work with Apache OpenNLP?  From my 
little experience, it seems like all this does is using the Max Ent algorithm 
to predict outcome that are predetermined in a model file “en-doccat.bin”?

Thank you very much for taking your time to answer my questions.  I’m very 
interested in A.I. technology and machine learning and this will go a long way 
in helping me learn about this new and exciting field.

Regards,



Armando Perez

Document Categorizer Tool questions

Reply via email to