Hi Ben, I have some experience with OpenNLP Doccat and I can answer form my experience
I am using NaiveBayes, i have become convinced that it works better (for me) than MaxEnt, My setup is that i have not much training data. http://www.ifp.illinois.edu/~iracohen/publications/precision-ecml04-ColorTR-final.pdf I work with more than 2 classes but only one class is assigned to a document. Probabilities for all classes are available but only the best category is printed. Hack the code to get all categories with probabilities :) I always give to the training texts that are pre-tokenized. Doccat tokenizes by whitespace only. Adding word bigrams usually helps to improve prediction quality. Where I saw very good boost of quality is feature selection. So far I have only used chi2 and want to try Information Gain (check the apper mentioned above). I do it before training with a set of additional scripts. best regards, Nikolai On Tue, Oct 2, 2018 at 7:28 PM Benedict Holland < benedict.m.holl...@gmail.com> wrote: > Hello all, > > I have a few questions about the document categorizer that reading the > manual didn't solve. > > 1. How many individual categories can I include in the training data? > > 2. Assume I have C categories. If I assume a document will have multiple > categories *c*, should I develop C separate models where labels are is_*c > *and > is_not_*c*? For example, assume I have a corpora of text from pet store > advertisements. Model 1 would have tags: is_about_cats and > is_not_about_cats. Model 2 would have tags: is_about_dogs and > is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about > birds. One could imagine an ad would be about cats, dogs, and not birds > (for example). > > 3. When I use the model to estimate the category for a document, do I get a > probability for each of the categories? > > 4. Should I stem the text tokens or will the ME function handle that for > me? > > 5. How can I add to the ME function to test out if there are features that > the ME model does not currently include that are probably important? This > might get into model development. I am not sure. It is entirely possible > that I missed that in the documentation. > > Thank you so much! > ~Ben >