Re: Document Categorizer questions

Benedict Holland Wed, 03 Oct 2018 08:50:17 -0700

Hi Daniel,

Yes. I am honestly not sure if multi-level classifiers make sense when
multiple binary classifiers are so easy. At the end of the day, these are
all likelihood estimates and logit models model binary outcomes. It was
just strange that in the documentation it made it look like I could have a
bunch of tags on text in the same file but unless OpenNLP is splitting
those out, I don't know how OpenNLP is managing it. Like, I wasn't sure if
OpenNLP was actually creating C binary classifiers based on the tags
themselves.


On most of the OpenNLP tokenization, especially in the MaxEnt models,
OpenNLP typically accepts tokens, POS tags, and other metadata. For the
document classifier, it doesn't seem to work that way. It just seems to
accept tokens and I can't quite figure out why. POS tags seem like
something important unless this is simply running a token frequency
analysis or an LDA or something along those lines, which is fine but I
don't quite understand why a MaxEnt would be required. Like, I would like
to know what this is actually doing with a set of tokens.

The deep learning stuff seems to all rely on brute force and seeing what
sticks. I like the MaxEnt models because it doesn't seem nearly as
arbitrary or black boxy as a DNN. I assume that a DNN produces better
outcomes simply because it examines so many different possible independent
variables. Basically, it is a very elaborate model selection algorithm and
once it produces outcomes, we can look into the independent variables and
wonder why these mattered.

Is it possible to modify or append data to the OpenNLP MaxEnt model
framework? I think I might have missed that.

Thanks,
~Ben

On Wed, Oct 3, 2018 at 10:19 AM Daniel Russ <dr...@apache.org> wrote:

> Hi Ben,
>
>    It sound like you want a multi-label classifier (which can give you
> more than 1 outcome class).  There are different ways of attacking the
> problem.  You can have multiple binary classifiers for each outcome class c
> (vs not c).  There may be some overall normalization issues, but yes it
> should be the (model) probability of being in a group. Being that
> multi-label classification is not my speciality, I'm going to toss it to
> the rest of the group for other suggestions.
>
>    Stemming may improve your results, but it may have no effect.  Test it
> and you’ll see.
>
>   Your question five is really interesting and NLP application researchers
> struggle with this.  You are asking “what am I missing? Or can I use my
> knowledge of the problem to improve the classifier?”  Sorry, but the answer
> is maybe.    This is why you need “big data” to train really good models.
> Your model needs to see many many different scenarios to learn how to adapt
> to the problem. Feature engineering is really difficult and not what people
> do well.  I hope you are starting to see why deep learning is beating other
> methodologies.  It can weigh many non-linear combinations of features for
> the best set of features for classification (limited only by the features
> you supply). Deep learning is kind of like modeling the features.
>
> Hope it helps
> Daniel
>
>
> > On Oct 2, 2018, at 1:28 PM, Benedict Holland <
> benedict.m.holl...@gmail.com> wro
> > te:
> >
> > Hello all,
> >
> > I have a few questions about the document categorizer that reading the
> > manual didn't solve.
> >
> > 1. How many individual categories can I include in the training data?
> >
> > 2. Assume I have C categories. If I assume a document will have multiple
> > categories *c*, should I develop C separate models where labels are
> is_*c *and
> > is_not_*c*? For example, assume I have a corpora of text from pet store
> > advertisements. Model 1 would have tags: is_about_cats and
> > is_not_about_cats. Model 2 would have tags: is_about_dogs and
> > is_not_about_dogs. Model 3 would have tags: is_about_birds and
> is_not_about
> > birds. One could imagine an ad would be about cats, dogs, and not birds
> > (for example).
> >
> > 3. When I use the model to estimate the category for a document, do I
> get a
> > probability for each of the categories?
> >
> > 4. Should I stem the text tokens or will the ME function handle that for
> > me?
> >
> > 5. How can I add to the ME function to test out if there are features
> that
> > the ME model does not currently include that are probably important? This
> > might get into model development. I am not sure. It is entirely possible
> > that I missed that in the documentation.
> >
> > Thank you so much!
> > ~Ben
>
>

Re: Document Categorizer questions

Reply via email to