Shouldn't this be 'unclassified'? I think I have seen data in the unclassified buckets with both Bayes and SGD.
----- Original Message ----- | From: "Ted Dunning" <[email protected]> | To: [email protected] | Sent: Wednesday, September 19, 2012 2:54:25 PM | Subject: Re: The default category of a binary classifier | | If a classifier is presented text with no words in common with the | training | data, it will give you back the most common category in the training | data. | | That said, it is likely to be quite rare when a new document consists | *entirely* of new words. Any overlap with trained vocabulary is | likely to | over-ride the basic frequencies of different categories. | | On Wed, Sep 19, 2012 at 1:32 AM, Salman Mahmood | <[email protected]>wrote: | | > First, in mahout, is there a special way to create binary | > classifier? for | > instance if I am creating classifier for 20 news group data, I will | > just | > pass 20 as number of categories when creating the training object: | > | > new AdaptiveLogisticRegression(20, FEATURES, new L1()) | > | > Similarly when creating a binary classifier, I will pass 2 as the | > number | > of categories and thats it? | > | > Having established that, what is the default category for a binary | > classifier? Lets say I was building a classifier to recognize the | > industry | > sector for a news item. I have binary models to classify things | > into | > technology or not technology, banking or not banking, health or not | > health | > etc. I trained the technology model with technology related news as | > positive and all the other news as negative (banking and health). | > Now if | > the technology model got a news item to classify, from the media | > sector | > (which it was not trained on), what is the expected behavior? Is it | > gonna | > say it's a technology news or its not a technology news? any | > default | > behavior for unseen/untrained news items? | > Hope I made the question clear. | > Thanks |
