Also please forgive any grammatical answers.

On Mon, Sep 26, 2011 at 4:53 PM, Zach Richardson <[email protected]> wrote:

> Ok that makes sense.
>
> In general classification methods with text are not super awesome with edge
> cases.  The best way to prevent that is to just have a very large training
> set, and pick your categories very carefully.
>
> Remember, your just trying to produce something that is "mostly" right.  I
> would just accept that the name "Ted Dunning" might get learned as a
> feature, and that the probability that it is relevant is worth getting it
> wrong infrequently.
>
> If you are using something like an SVM, you can look at the support vectors
> and feature weightings to see what the model is learning, and then use that
> filter more words from your training set.  For instance, it might be worth
> removing names from the training set so that the your model doesn't learn
> them.
>
> I know the first time we played with the 20 Newsgroups it was heavily
> weighting the email addresses of the people posting--which means that the
> model wouldn't generalize well.  So we filtered out email addresses.
>
> Not sure if this is helpful or not.  Just my 2 cents.
>
> Zach
>
>
> On Mon, Sep 26, 2011 at 11:08 AM, Em <[email protected]> wrote:
>
>> Zach,
>>
>> thanks for your feedback!
>>
>> I want to categorize them into a general-purpose category (nothing
>> individual).
>> The goal is to get an overview about every document that has to do with
>> the domain in some way and to throw away everything else.
>>
>> Regards,
>> Em
>>
>> Am 26.09.2011 17:11, schrieb Zach Richardson:
>> > Em,
>> >
>> > This really all depends on your goal.  Do you want them to be scored as
>> > interesting to an individual or do you want them categorized into
>> topics?
>> >
>> > How you set those problems up can be very different based on the end
>> goal.
>> >  What is yours?
>> >
>> > Thanks,
>> >
>> > Zach
>> >
>> >
>> > On Mon, Sep 26, 2011 at 9:55 AM, Em <[email protected]>
>> wrote:
>> >
>> >> No experiences?
>> >>
>> >> Regards,
>> >> Em
>> >>
>> >> Am 23.09.2011 12:48, schrieb Em:
>> >>> Hello list,
>> >>>
>> >>> let's say I want to classifiy documents and there are two possible
>> >> outcomes:
>> >>> Yes, the document belongs to the topic I focus on, or No, it doesn't.
>> >>>
>> >>> The topic is for example: Machine Learning.
>> >>>
>> >>> Doc1: A sub-chapter of the book "Mahout in Action"
>> >>> Doc2: A paper about clustering-techniques
>> >>> Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking
>> about
>> >>> his opinion regarding the relationship between Google and Oracle
>> >>> Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry
>> >>> Ted, you are my guinea pig in this case)
>> >>>
>> >>> The point is: Doc3 is not really about Machine Learning, however it
>> >>> might be relevant for people that are interested in Machine Learning,
>> >>> since the author is a Machine-Learning-Expert and his opinion might
>> >>> reflect some thoughts regarding that domain.
>> >>>
>> >>> Doc4 is completely irrelevant. It has to do with Ted Dunning, but not
>> >>> with Machine Learning nor software at all. The only exception would be
>> >>> if Ted wrote a piece of Machine Learning software that is creating a
>> >>> recipe for cooking tasty spagetti ;).
>> >>>
>> >>> If I change the topic to something like "Star Trek":
>> >>>
>> >>> Doc1: A review of a Star Trek movie
>> >>> Doc2: A Star Trek computer game's description
>> >>> Doc3: A review regarding a PlayStation 3 Star Trek game
>> >>> Doc4: The announcement that the gaming studio of the Star Trek games
>> is
>> >>> going to create a new Star Wars game
>> >>> Doc5: A Star Wars book's description
>> >>> Doc6: The gaming studio of the Star Trek games is going to create a
>> need
>> >>> for speed clone
>> >>>
>> >>> Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well,
>> because
>> >>> the studio is an authority for creating good Star Trek games and they
>> >>> noted that their experiences with Star Trek will help them building a
>> >>> good Star Wars game. Some fans might be interested in this.
>> >>>
>> >>> However doc 5 is completely irrelevant, since it has nothing to do
>> with
>> >>> Star Trek.
>> >>> Doc 6 is about an authority in the Star Trek merchandise-industry but
>> it
>> >>> correlates with my Ted-cooks-spagetti example from my first example -
>> >>> Doc 6 is irrelevant.
>> >>>
>> >>> Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one
>> >>> are boundary values for beeing relevant. They might interest people
>> that
>> >>> focus on the two named domains, but they sail very close to the wind.
>> >>>
>> >>> Does it generally make sense to take such examples into account for
>> >>> training a model? Real humans may have a discussion about those
>> examples
>> >>> whether they really belong to the domain they want to focus on.
>> >>>
>> >>> Thank you for your advice.
>> >>>
>> >>> Regards,
>> >>> Em
>> >>
>> >
>> >
>> >
>>
>
>
>
> --
> Zach Richardson
> Ravel, Co-founder
> Austin, TX
> [email protected]
> 512.825.6031
>
>
>


-- 
Zach Richardson
Ravel, Co-founder
Austin, TX
[email protected]
512.825.6031

Reply via email to