So I'm just getting started with openNLP and trying to spin-up the DocCat.

I would like to process a series of files in batches to train the document categorizer.

I assume it is possible to loop through documents:

1) extract the text (will probably try Tika for this), and then
2) send the DocumentSample to the categorizer to add to the model?

I see how I can create a DocumentSample from a category (I will know this as part of the batch args) and the extracted text. However, I can not figure out how to incrementally add that sample to a new (or existing) model for additional "training".

Obviously, I would like to then save the model between batches so I can either leverage it for categorization or incrementally add more Document Sample's to it for further training at some later time.

Does anyone have a java snippet I could look at to help me get started?

Thank you!

-AJ

Reply via email to