Hi Ben,

   It sound like you want a multi-label classifier (which can give you more 
than 1 outcome class).  There are different ways of attacking the problem.  You 
can have multiple binary classifiers for each outcome class c (vs not c).  
There may be some overall normalization issues, but yes it should be the 
(model) probability of being in a group. Being that multi-label classification 
is not my speciality, I'm going to toss it to the rest of the group for other 
suggestions.

   Stemming may improve your results, but it may have no effect.  Test it and 
you’ll see.   

  Your question five is really interesting and NLP application researchers 
struggle with this.  You are asking “what am I missing? Or can I use my 
knowledge of the problem to improve the classifier?”  Sorry, but the answer is 
maybe.    This is why you need “big data” to train really good models.  Your 
model needs to see many many different scenarios to learn how to adapt to the 
problem. Feature engineering is really difficult and not what people do well.  
I hope you are starting to see why deep learning is beating other 
methodologies.  It can weigh many non-linear combinations of features for the 
best set of features for classification (limited only by the features you 
supply). Deep learning is kind of like modeling the features. 

Hope it helps
Daniel


> On Oct 2, 2018, at 1:28 PM, Benedict Holland <benedict.m.holl...@gmail.com> 
> wro
> te:
> 
> Hello all,
> 
> I have a few questions about the document categorizer that reading the
> manual didn't solve.
> 
> 1. How many individual categories can I include in the training data?
> 
> 2. Assume I have C categories. If I assume a document will have multiple
> categories *c*, should I develop C separate models where labels are is_*c *and
> is_not_*c*? For example, assume I have a corpora of text from pet store
> advertisements. Model 1 would have tags: is_about_cats and
> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about
> birds. One could imagine an ad would be about cats, dogs, and not birds
> (for example).
> 
> 3. When I use the model to estimate the category for a document, do I get a
> probability for each of the categories?
> 
> 4. Should I stem the text tokens or will the ME function handle that for
> me?
> 
> 5. How can I add to the ME function to test out if there are features that
> the ME model does not currently include that are probably important? This
> might get into model development. I am not sure. It is entirely possible
> that I missed that in the documentation.
> 
> Thank you so much!
> ~Ben

Reply via email to