This may also be relevant: "Logistic Regression in Rare Events Data"
http://gking.harvard.edu/gking/files/abs/0s-abs.shtml JP On Tue, Oct 2, 2012 at 7:09 AM, Ted Dunning <[email protected]> wrote: > Having lots of negative samples won't improve performance that much > (shouldn't hurt much either). > > The negative examples that you really want are the ones that are close to > your positive examples. > > On Mon, Oct 1, 2012 at 10:54 AM, Salman Mahmood <[email protected]>wrote: > >> I am making a binary classifier. Lets assume the classifier decides if a >> particular news item is about Appache or not. I have got 200 positive >> examples/news about Appache. >> I am a bit confused about the negative examples, because there could be a >> huge number of negative examples. What strategy should I go for when >> preparing the negative data? >> with 200 positive examples, will it make sense if I train the classifier >> with 5000 negative data with examples from all other sectors of news >> (finance, health, sports, misc, travel etc) or the difference between the >> positive and the negative data should not be in thousands? in which case I >> am afraid the classifier will not be properly trained trained. >> >> >> -- Twitter: @jpatanooga Principal Solution Architect @ Cloudera hadoop: http://www.cloudera.com
