Re: Boundary Values for Training Data

Ted Dunning Mon, 26 Sep 2011 15:09:33 -0700

It is possible to have a secondary model that is intended just to pick out
the lint of the primary model.  Since it has a more limited domain, it may
get confused less.


Taking a step like eliminating the email addresses has to be done carefully
to determine whether it really makes sense in the context of the intended
application.  For instance, with 20 newsgroups, if you wanted highest
accuracy and didn't mind retraining often, then keeping the email addresses
would be a great thing because it would let the model make use of the
continuity of participation in a group over short periods of time.  On the
other hand, if you really are trying to make a more general purpose
classifier possibly because the training data is very different from the
production, possibly even from a different source entirely, then eliminating
the email addresses might make sense.

In particular, you might use a data set like 20 newsgroups to build
preliminary classifiers that you apply to some other kind of data to find
potential training data from that other kind of data.  This kind of
bootstrapping can be very, very useful.

On Mon, Sep 26, 2011 at 2:54 PM, Zach Richardson <[email protected]> wrote:

> Also please forgive any grammatical answers.
>
> On Mon, Sep 26, 2011 at 4:53 PM, Zach Richardson <[email protected]>
> wrote:
>
> > Ok that makes sense.
> >
> > In general classification methods with text are not super awesome with
> edge
> > cases.  The best way to prevent that is to just have a very large
> training
> > set, and pick your categories very carefully.
> >
> > Remember, your just trying to produce something that is "mostly" right.
>  I
> > would just accept that the name "Ted Dunning" might get learned as a
> > feature, and that the probability that it is relevant is worth getting it
> > wrong infrequently.
> >
> > If you are using something like an SVM, you can look at the support
> vectors
> > and feature weightings to see what the model is learning, and then use
> that
> > filter more words from your training set.  For instance, it might be
> worth
> > removing names from the training set so that the your model doesn't learn
> > them.
> >
> > I know the first time we played with the 20 Newsgroups it was heavily
> > weighting the email addresses of the people posting--which means that the
> > model wouldn't generalize well.  So we filtered out email addresses.
> >
> > Not sure if this is helpful or not.  Just my 2 cents.
> >
> > Zach
> >
> >
> > On Mon, Sep 26, 2011 at 11:08 AM, Em <[email protected]>
> wrote:
> >
> >> Zach,
> >>
> >> thanks for your feedback!
> >>
> >> I want to categorize them into a general-purpose category (nothing
> >> individual).
> >> The goal is to get an overview about every document that has to do with
> >> the domain in some way and to throw away everything else.
> >>
> >> Regards,
> >> Em
> >>
> >> Am 26.09.2011 17:11, schrieb Zach Richardson:
> >> > Em,
> >> >
> >> > This really all depends on your goal.  Do you want them to be scored
> as
> >> > interesting to an individual or do you want them categorized into
> >> topics?
> >> >
> >> > How you set those problems up can be very different based on the end
> >> goal.
> >> >  What is yours?
> >> >
> >> > Thanks,
> >> >
> >> > Zach
> >> >
> >> >
> >> > On Mon, Sep 26, 2011 at 9:55 AM, Em <[email protected]>
> >> wrote:
> >> >
> >> >> No experiences?
> >> >>
> >> >> Regards,
> >> >> Em
> >> >>
> >> >> Am 23.09.2011 12:48, schrieb Em:
> >> >>> Hello list,
> >> >>>
> >> >>> let's say I want to classifiy documents and there are two possible
> >> >> outcomes:
> >> >>> Yes, the document belongs to the topic I focus on, or No, it
> doesn't.
> >> >>>
> >> >>> The topic is for example: Machine Learning.
> >> >>>
> >> >>> Doc1: A sub-chapter of the book "Mahout in Action"
> >> >>> Doc2: A paper about clustering-techniques
> >> >>> Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking
> >> about
> >> >>> his opinion regarding the relationship between Google and Oracle
> >> >>> Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry
> >> >>> Ted, you are my guinea pig in this case)
> >> >>>
> >> >>> The point is: Doc3 is not really about Machine Learning, however it
> >> >>> might be relevant for people that are interested in Machine
> Learning,
> >> >>> since the author is a Machine-Learning-Expert and his opinion might
> >> >>> reflect some thoughts regarding that domain.
> >> >>>
> >> >>> Doc4 is completely irrelevant. It has to do with Ted Dunning, but
> not
> >> >>> with Machine Learning nor software at all. The only exception would
> be
> >> >>> if Ted wrote a piece of Machine Learning software that is creating a
> >> >>> recipe for cooking tasty spagetti ;).
> >> >>>
> >> >>> If I change the topic to something like "Star Trek":
> >> >>>
> >> >>> Doc1: A review of a Star Trek movie
> >> >>> Doc2: A Star Trek computer game's description
> >> >>> Doc3: A review regarding a PlayStation 3 Star Trek game
> >> >>> Doc4: The announcement that the gaming studio of the Star Trek games
> >> is
> >> >>> going to create a new Star Wars game
> >> >>> Doc5: A Star Wars book's description
> >> >>> Doc6: The gaming studio of the Star Trek games is going to create a
> >> need
> >> >>> for speed clone
> >> >>>
> >> >>> Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well,
> >> because
> >> >>> the studio is an authority for creating good Star Trek games and
> they
> >> >>> noted that their experiences with Star Trek will help them building
> a
> >> >>> good Star Wars game. Some fans might be interested in this.
> >> >>>
> >> >>> However doc 5 is completely irrelevant, since it has nothing to do
> >> with
> >> >>> Star Trek.
> >> >>> Doc 6 is about an authority in the Star Trek merchandise-industry
> but
> >> it
> >> >>> correlates with my Ted-cooks-spagetti example from my first example
> -
> >> >>> Doc 6 is irrelevant.
> >> >>>
> >> >>> Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek"
> one
> >> >>> are boundary values for beeing relevant. They might interest people
> >> that
> >> >>> focus on the two named domains, but they sail very close to the
> wind.
> >> >>>
> >> >>> Does it generally make sense to take such examples into account for
> >> >>> training a model? Real humans may have a discussion about those
> >> examples
> >> >>> whether they really belong to the domain they want to focus on.
> >> >>>
> >> >>> Thank you for your advice.
> >> >>>
> >> >>> Regards,
> >> >>> Em
> >> >>
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Zach Richardson
> > Ravel, Co-founder
> > Austin, TX
> > [email protected]
> > 512.825.6031
> >
> >
> >
>
>
> --
> Zach Richardson
> Ravel, Co-founder
> Austin, TX
> [email protected]
> 512.825.6031
>

Re: Boundary Values for Training Data

Reply via email to