So I'm just getting started with openNLP and trying to spin-up the DocCat.
I would like to process a series of files in batches to train the
document categorizer.
I assume it is possible to loop through documents:
1) extract the text (will probably try Tika for this), and then
2) send the DocumentSample to the categorizer to add to the model?
I see how I can create a DocumentSample from a category (I will know
this as part of the batch args) and the extracted text. However, I can
not figure out how to incrementally add that sample to a new (or
existing) model for additional "training".
Obviously, I would like to then save the model between batches so I can
either leverage it for categorization or incrementally add more Document
Sample's to it for further training at some later time.
Does anyone have a java snippet I could look at to help me get started?
Thank you!
-AJ