Hi All,

I say this because the softmax or logit model produces a probability of an
event or events occurring. Assuming they didn't use any mixing
distribution or anything fancy, we make the IIA assumption with logits.
Where I find this is most powerful is with a binary case. While, we can
technically use it to model several events, each of the events are
independent and we really care about the ratio between them all to take out
the denominator. Really, we care about the relative log odds, sometimes.
Again, it really depends on what is going on with the MaxEnt function and
how they are using it. If OpenNLP assumes 2 categories per input document
set, the result is exactly as you state. If they start to assume multiple
categories, particularly for the same text, then how we interpret the
classification probabilities would change.

So I guess, while this is a brute force method, if I have documents that
can belong to 1 or more C categories, is a good solution to develop C
binary models? I believe the only explicit assumption is that the
categories do not overlap, which they obviously don't. I can easily Oh
also, I really don't know how a logit model would handle probability
overlap given the IIA assumption. For example, even in the simple case of
A, B, A&B outcomes, the assumption is violated. At least, I think this is
true. Oh also, I really don't know how a logit model would handle
probability overlap given the IIA assumption. For example, even in the
simple case of A, B, A&B outcomes, the assumption is violated. At least, I
think this is true. do this, even for thousands of separate models. It
isn't that difficult.

I might be wrong about this. NLP comes from a rather odd direction (I do
discrete choice modeling and CS).

Thanks,
~Ben



On Wed, Oct 3, 2018 at 2:02 PM Daniel Russ <dr...@apache.org> wrote:

> Hi Ben,
>
>    I disagree with your assessment that it is a logit model and therefore
> is binary.  MaxEnt is more of a case where you are modeling the a
> Baseline-Category Logit for Nominal responses.  (See Agresti Intro. To
> Categorical Data Analysis 2nd Ed. Chapter 6.1). If you have a binary
> problem, this is exactly the log-odds.  The equations for Baseline-Category
> logic model are exactly the same as in the GISModel for making
> predictions.
>
>    Nikolai makes a interesting comment that Naive Bayes works better for
> him.  That is interesting because discriminative classification methods
> TEND to work better than generative classification methods for a nice
> discussion see (
> https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf)
> specifically see the stoplight problem.  However, Nikolai’s data may be
> have some property that works really well with NB. One thing to remember is
> that proof of the pudding is in the eating.
>
> Daniel
>
>
> > On Oct 3, 2018, at 11:49 AM, Benedict Holland <
> benedict.m.holl...@gmail.com> wrote:
> >
> > Hi Daniel,
> >
> > Yes. I am honestly not sure if multi-level classifiers make sense when
> > multiple binary classifiers are so easy. At the end of the day, these are
> > all likelihood estimates and logit models model binary outcomes. It was
> > just strange that in the documentation it made it look like I could have
> a
> > bunch of tags on text in the same file but unless OpenNLP is splitting
> > those out, I don't know how OpenNLP is managing it. Like, I wasn't sure
> if
> > OpenNLP was actually creating C binary classifiers based on the tags
> > themselves.
> >
> > On most of the OpenNLP tokenization, especially in the MaxEnt models,
> > OpenNLP typically accepts tokens, POS tags, and other metadata. For the
> > document classifier, it doesn't seem to work that way. It just seems to
> > accept tokens and I can't quite figure out why. POS tags seem like
> > something important unless this is simply running a token frequency
> > analysis or an LDA or something along those lines, which is fine but I
> > don't quite understand why a MaxEnt would be required. Like, I would like
> > to know what this is actually doing with a set of tokens.
> >
> > The deep learning stuff seems to all rely on brute force and seeing what
> > sticks. I like the MaxEnt models because it doesn't seem nearly as
> > arbitrary or black boxy as a DNN. I assume that a DNN produces better
> > outcomes simply because it examines so many different possible
> independent
> > variables. Basically, it is a very elaborate model selection algorithm
> and
> > once it produces outcomes, we can look into the independent variables and
> > wonder why these mattered.
> >
> > Is it possible to modify or append data to the OpenNLP MaxEnt model
> > framework? I think I might have missed that.
> >
> > Thanks,
> > ~Ben
> >
> > On Wed, Oct 3, 2018 at 10:19 AM Daniel Russ <dr...@apache.org> wrote:
> >
> >> Hi Ben,
> >>
> >>   It sound like you want a multi-label classifier (which can give you
> >> more than 1 outcome class).  There are different ways of attacking the
> >> problem.  You can have multiple binary classifiers for each outcome
> class c
> >> (vs not c).  There may be some overall normalization issues, but yes it
> >> should be the (model) probability of being in a group. Being that
> >> multi-label classification is not my speciality, I'm going to toss it to
> >> the rest of the group for other suggestions.
> >>
> >>   Stemming may improve your results, but it may have no effect.  Test it
> >> and you’ll see.
> >>
> >>  Your question five is really interesting and NLP application
> researchers
> >> struggle with this.  You are asking “what am I missing? Or can I use my
> >> knowledge of the problem to improve the classifier?”  Sorry, but the
> answer
> >> is maybe.    This is why you need “big data” to train really good
> models.
> >> Your model needs to see many many different scenarios to learn how to
> adapt
> >> to the problem. Feature engineering is really difficult and not what
> people
> >> do well.  I hope you are starting to see why deep learning is beating
> other
> >> methodologies.  It can weigh many non-linear combinations of features
> for
> >> the best set of features for classification (limited only by the
> features
> >> you supply). Deep learning is kind of like modeling the features.
> >>
> >> Hope it helps
> >> Daniel
> >>
> >>
> >>> On Oct 2, 2018, at 1:28 PM, Benedict Holland <
> >> benedict.m.holl...@gmail.com> wro
> >>> te:
> >>>
> >>> Hello all,
> >>>
> >>> I have a few questions about the document categorizer that reading the
> >>> manual didn't solve.
> >>>
> >>> 1. How many individual categories can I include in the training data?
> >>>
> >>> 2. Assume I have C categories. If I assume a document will have
> multiple
> >>> categories *c*, should I develop C separate models where labels are
> >> is_*c *and
> >>> is_not_*c*? For example, assume I have a corpora of text from pet store
> >>> advertisements. Model 1 would have tags: is_about_cats and
> >>> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> >>> is_not_about_dogs. Model 3 would have tags: is_about_birds and
> >> is_not_about
> >>> birds. One could imagine an ad would be about cats, dogs, and not birds
> >>> (for example).
> >>>
> >>> 3. When I use the model to estimate the category for a document, do I
> >> get a
> >>> probability for each of the categories?
> >>>
> >>> 4. Should I stem the text tokens or will the ME function handle that
> for
> >>> me?
> >>>
> >>> 5. How can I add to the ME function to test out if there are features
> >> that
> >>> the ME model does not currently include that are probably important?
> This
> >>> might get into model development. I am not sure. It is entirely
> possible
> >>> that I missed that in the documentation.
> >>>
> >>> Thank you so much!
> >>> ~Ben
> >>
> >>
>
>

Reply via email to