OpenNLP Team, I’m new to this field and I have a couple of questions both having to do with the Apache OpenNLP Document Categorizer Tool.
I have already created the document categorizer model called “en-doccat.bin” from a flat file called en-doccat.train. This model has a selection of over 40 categories in the first column and user textual data in the second column after the first whitespace character. This, user textual data, will become the tokenized data that will be used to try and predict the category that the user is trying to obtain. Obviously, the tokenized user text data and the category will be given a percentage score given for each possible categorical outcome. The first question I have is: On the already created “en-doccat.bin” model, once a user selects a particular category, is there a way teach the model that for this tokenized user data that this category is correct? Thus increasing the prediction percentage score with each correct answer. And vice versa, decreasing with each wrong answer. So as an example. I select category A from a GUI because it’s correct. Now can take that tokenized user data and the category and tell the model that it’s prediction is correct? Thus increasing its prediction score? The second question is: On the already created “en-doccat.bin” model, for missing categories, is there a way to add new categories or text to existing categories without have to recreate the entire model from scratch? Hence without having add the new data to the en-doccat.train file and recreating en-doccat.bin file all over again? In other words, incrementally increase the size of the “en-doccat.bin” model file with new data? The reason I ask the two questions, other than the obvious, is that without this, how does the machine learning actually work with Apache OpenNLP? From my little experience, it seems like all this does is using the Max Ent algorithm to predict outcome that are predetermined in a model file “en-doccat.bin”? Thank you very much for taking your time to answer my questions. I’m very interested in A.I. technology and machine learning and this will go a long way in helping me learn about this new and exciting field. Regards, Armando Perez