Hi Ben,

I have some experience with OpenNLP Doccat and I can answer form my
experience

I am using NaiveBayes, i have become convinced that it works better (for
me) than MaxEnt, My setup is that i have not much training data.
http://www.ifp.illinois.edu/~iracohen/publications/precision-ecml04-ColorTR-final.pdf

I work with more than 2 classes but only one class is assigned to a
document. Probabilities for all classes are available but only the best
category is printed. Hack the code to get all categories with probabilities
:)

I always give to the training texts that are pre-tokenized. Doccat
tokenizes by whitespace only.

Adding word bigrams usually helps to improve prediction quality.

Where I saw very good boost of quality is feature selection. So far I have
only used chi2 and want to try Information Gain (check the apper mentioned
above). I do it before training with a set of additional scripts.

best regards,
Nikolai

On Tue, Oct 2, 2018 at 7:28 PM Benedict Holland <
benedict.m.holl...@gmail.com> wrote:

> Hello all,
>
> I have a few questions about the document categorizer that reading the
> manual didn't solve.
>
> 1. How many individual categories can I include in the training data?
>
> 2. Assume I have C categories. If I assume a document will have multiple
> categories *c*, should I develop C separate models where labels are is_*c
> *and
> is_not_*c*? For example, assume I have a corpora of text from pet store
> advertisements. Model 1 would have tags: is_about_cats and
> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about
> birds. One could imagine an ad would be about cats, dogs, and not birds
> (for example).
>
> 3. When I use the model to estimate the category for a document, do I get a
> probability for each of the categories?
>
> 4. Should I stem the text tokens or will the ME function handle that for
> me?
>
> 5. How can I add to the ME function to test out if there are features that
> the ME model does not currently include that are probably important? This
> might get into model development. I am not sure. It is entirely possible
> that I missed that in the documentation.
>
> Thank you so much!
> ~Ben
>

Reply via email to