Hi Ben, It sound like you want a multi-label classifier (which can give you more than 1 outcome class). There are different ways of attacking the problem. You can have multiple binary classifiers for each outcome class c (vs not c). There may be some overall normalization issues, but yes it should be the (model) probability of being in a group. Being that multi-label classification is not my speciality, I'm going to toss it to the rest of the group for other suggestions.
Stemming may improve your results, but it may have no effect. Test it and you’ll see. Your question five is really interesting and NLP application researchers struggle with this. You are asking “what am I missing? Or can I use my knowledge of the problem to improve the classifier?” Sorry, but the answer is maybe. This is why you need “big data” to train really good models. Your model needs to see many many different scenarios to learn how to adapt to the problem. Feature engineering is really difficult and not what people do well. I hope you are starting to see why deep learning is beating other methodologies. It can weigh many non-linear combinations of features for the best set of features for classification (limited only by the features you supply). Deep learning is kind of like modeling the features. Hope it helps Daniel > On Oct 2, 2018, at 1:28 PM, Benedict Holland <benedict.m.holl...@gmail.com> > wro > te: > > Hello all, > > I have a few questions about the document categorizer that reading the > manual didn't solve. > > 1. How many individual categories can I include in the training data? > > 2. Assume I have C categories. If I assume a document will have multiple > categories *c*, should I develop C separate models where labels are is_*c *and > is_not_*c*? For example, assume I have a corpora of text from pet store > advertisements. Model 1 would have tags: is_about_cats and > is_not_about_cats. Model 2 would have tags: is_about_dogs and > is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about > birds. One could imagine an ad would be about cats, dogs, and not birds > (for example). > > 3. When I use the model to estimate the category for a document, do I get a > probability for each of the categories? > > 4. Should I stem the text tokens or will the ME function handle that for > me? > > 5. How can I add to the ME function to test out if there are features that > the ME model does not currently include that are probably important? This > might get into model development. I am not sure. It is entirely possible > that I missed that in the documentation. > > Thank you so much! > ~Ben