Also please forgive any grammatical answers. On Mon, Sep 26, 2011 at 4:53 PM, Zach Richardson <[email protected]> wrote:
> Ok that makes sense. > > In general classification methods with text are not super awesome with edge > cases. The best way to prevent that is to just have a very large training > set, and pick your categories very carefully. > > Remember, your just trying to produce something that is "mostly" right. I > would just accept that the name "Ted Dunning" might get learned as a > feature, and that the probability that it is relevant is worth getting it > wrong infrequently. > > If you are using something like an SVM, you can look at the support vectors > and feature weightings to see what the model is learning, and then use that > filter more words from your training set. For instance, it might be worth > removing names from the training set so that the your model doesn't learn > them. > > I know the first time we played with the 20 Newsgroups it was heavily > weighting the email addresses of the people posting--which means that the > model wouldn't generalize well. So we filtered out email addresses. > > Not sure if this is helpful or not. Just my 2 cents. > > Zach > > > On Mon, Sep 26, 2011 at 11:08 AM, Em <[email protected]> wrote: > >> Zach, >> >> thanks for your feedback! >> >> I want to categorize them into a general-purpose category (nothing >> individual). >> The goal is to get an overview about every document that has to do with >> the domain in some way and to throw away everything else. >> >> Regards, >> Em >> >> Am 26.09.2011 17:11, schrieb Zach Richardson: >> > Em, >> > >> > This really all depends on your goal. Do you want them to be scored as >> > interesting to an individual or do you want them categorized into >> topics? >> > >> > How you set those problems up can be very different based on the end >> goal. >> > What is yours? >> > >> > Thanks, >> > >> > Zach >> > >> > >> > On Mon, Sep 26, 2011 at 9:55 AM, Em <[email protected]> >> wrote: >> > >> >> No experiences? >> >> >> >> Regards, >> >> Em >> >> >> >> Am 23.09.2011 12:48, schrieb Em: >> >>> Hello list, >> >>> >> >>> let's say I want to classifiy documents and there are two possible >> >> outcomes: >> >>> Yes, the document belongs to the topic I focus on, or No, it doesn't. >> >>> >> >>> The topic is for example: Machine Learning. >> >>> >> >>> Doc1: A sub-chapter of the book "Mahout in Action" >> >>> Doc2: A paper about clustering-techniques >> >>> Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking >> about >> >>> his opinion regarding the relationship between Google and Oracle >> >>> Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry >> >>> Ted, you are my guinea pig in this case) >> >>> >> >>> The point is: Doc3 is not really about Machine Learning, however it >> >>> might be relevant for people that are interested in Machine Learning, >> >>> since the author is a Machine-Learning-Expert and his opinion might >> >>> reflect some thoughts regarding that domain. >> >>> >> >>> Doc4 is completely irrelevant. It has to do with Ted Dunning, but not >> >>> with Machine Learning nor software at all. The only exception would be >> >>> if Ted wrote a piece of Machine Learning software that is creating a >> >>> recipe for cooking tasty spagetti ;). >> >>> >> >>> If I change the topic to something like "Star Trek": >> >>> >> >>> Doc1: A review of a Star Trek movie >> >>> Doc2: A Star Trek computer game's description >> >>> Doc3: A review regarding a PlayStation 3 Star Trek game >> >>> Doc4: The announcement that the gaming studio of the Star Trek games >> is >> >>> going to create a new Star Wars game >> >>> Doc5: A Star Wars book's description >> >>> Doc6: The gaming studio of the Star Trek games is going to create a >> need >> >>> for speed clone >> >>> >> >>> Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well, >> because >> >>> the studio is an authority for creating good Star Trek games and they >> >>> noted that their experiences with Star Trek will help them building a >> >>> good Star Wars game. Some fans might be interested in this. >> >>> >> >>> However doc 5 is completely irrelevant, since it has nothing to do >> with >> >>> Star Trek. >> >>> Doc 6 is about an authority in the Star Trek merchandise-industry but >> it >> >>> correlates with my Ted-cooks-spagetti example from my first example - >> >>> Doc 6 is irrelevant. >> >>> >> >>> Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one >> >>> are boundary values for beeing relevant. They might interest people >> that >> >>> focus on the two named domains, but they sail very close to the wind. >> >>> >> >>> Does it generally make sense to take such examples into account for >> >>> training a model? Real humans may have a discussion about those >> examples >> >>> whether they really belong to the domain they want to focus on. >> >>> >> >>> Thank you for your advice. >> >>> >> >>> Regards, >> >>> Em >> >> >> > >> > >> > >> > > > > -- > Zach Richardson > Ravel, Co-founder > Austin, TX > [email protected] > 512.825.6031 > > > -- Zach Richardson Ravel, Co-founder Austin, TX [email protected] 512.825.6031
