Hello list, let's say I want to classifiy documents and there are two possible outcomes: Yes, the document belongs to the topic I focus on, or No, it doesn't.
The topic is for example: Machine Learning. Doc1: A sub-chapter of the book "Mahout in Action" Doc2: A paper about clustering-techniques Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking about his opinion regarding the relationship between Google and Oracle Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry Ted, you are my guinea pig in this case) The point is: Doc3 is not really about Machine Learning, however it might be relevant for people that are interested in Machine Learning, since the author is a Machine-Learning-Expert and his opinion might reflect some thoughts regarding that domain. Doc4 is completely irrelevant. It has to do with Ted Dunning, but not with Machine Learning nor software at all. The only exception would be if Ted wrote a piece of Machine Learning software that is creating a recipe for cooking tasty spagetti ;). If I change the topic to something like "Star Trek": Doc1: A review of a Star Trek movie Doc2: A Star Trek computer game's description Doc3: A review regarding a PlayStation 3 Star Trek game Doc4: The announcement that the gaming studio of the Star Trek games is going to create a new Star Wars game Doc5: A Star Wars book's description Doc6: The gaming studio of the Star Trek games is going to create a need for speed clone Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well, because the studio is an authority for creating good Star Trek games and they noted that their experiences with Star Trek will help them building a good Star Wars game. Some fans might be interested in this. However doc 5 is completely irrelevant, since it has nothing to do with Star Trek. Doc 6 is about an authority in the Star Trek merchandise-industry but it correlates with my Ted-cooks-spagetti example from my first example - Doc 6 is irrelevant. Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one are boundary values for beeing relevant. They might interest people that focus on the two named domains, but they sail very close to the wind. Does it generally make sense to take such examples into account for training a model? Real humans may have a discussion about those examples whether they really belong to the domain they want to focus on. Thank you for your advice. Regards, Em
