Hi Armando, First, I know that you have a problem to solve, but it would be easier to learn how to use openNLP if you had a 2 or 3 class problem instead of a 40 class problem.
Anyway, congratulations on training your first model. > On the already created “en-doccat.bin” model, once a user selects a > particular category, is there a way teach the model that for this tokenized > user data that this category is correct? I am a little confused by this statement. You already trained the model, are you trying to add more data to the training set? I don’t believe the models can be updated. Add the new data to your training set. If you are testing your model with this data, then you cannot use the data to train your model because then you have not idea how the model works on unseen data. Are you talking about boosting (or is it bagging)? you can boost (or maybe bag), but it is not currently supported by OpenNLP. That may not be a bad idea, but we need to think if that works with the concept of MaxEnt classifiers. > is there a way to add new categories or text to existing categories without > have to recreate the entire model from scratch? Not that I know of… sorry. > how does the machine learning actually work with Apache OpenNLP? From my > little experience, it seems like all this does is using the Max Ent algorithm > to predict outcome that are predetermined in a model file “en-doccat.bin”? Think of a classifier as a function. Once you specify the function, then yes, all values of the function are predetermined. en-doccat.bin holds all the weights that define the classifier, once you have the weights, the classifier is fully defined for ALL potential input. But that is not a bad thing, how else would a classier work? Add data and the response is “sorry, can’t do it”. Remember that a model is not reality, we just hope to approximate it as well as possible. When things don’t match, we need to improve our model (usually with more data). Hope it helps, Daniel > On Aug 8, 2017, at 1:22 PM, Armando Perez <perez_arma...@hotmail.com> wrote: > > OpenNLP Team, > > I’m new to this field and I have a couple of questions both having to do with > the Apache OpenNLP Document Categorizer Tool. > > I have already created the document categorizer model called “en-doccat.bin” > from a flat file called en-doccat.train. This model has a selection of over > 40 categories in the first column and user textual data in the second column > after the first whitespace character. This, user textual data, will become > the tokenized data that will be used to try and predict the category that the > user is trying to obtain. Obviously, the tokenized user text data and the > category will be given a percentage score given for each possible categorical > outcome. > > The first question I have is: On the already created “en-doccat.bin” model, > once a user selects a particular category, is there a way teach the model > that for this tokenized user data that this category is correct? Thus > increasing the prediction percentage score with each correct answer. And > vice versa, decreasing with each wrong answer. > > So as an example. I select category A from a GUI because it’s correct. Now > can take that tokenized user data and the category and tell the model that > it’s prediction is correct? Thus increasing its prediction score? > > The second question is: On the already created “en-doccat.bin” model, for > missing categories, is there a way to add new categories or text to existing > categories without have to recreate the entire model from scratch? Hence > without having add the new data to the en-doccat.train file and recreating > en-doccat.bin file all over again? In other words, incrementally increase > the size of the “en-doccat.bin” model file with new data? > > The reason I ask the two questions, other than the obvious, is that without > this, how does the machine learning actually work with Apache OpenNLP? From > my little experience, it seems like all this does is using the Max Ent > algorithm to predict outcome that are predetermined in a model file > “en-doccat.bin”? > > Thank you very much for taking your time to answer my questions. I’m very > interested in A.I. technology and machine learning and this will go a long > way in helping me learn about this new and exciting field. > > Regards, > > > > Armando Perez