Thanks Ted and Lance for the suggestions! On Sep 20, 2012, at 3:05 AM, Ted Dunning wrote:
> With SGD, you can train for an unclassified category, but the system will > always produce scores for all trained categories. You might interpret > these to decide when there is no decision, but the model itself has no > concept directly of "unclassified". > > On Wed, Sep 19, 2012 at 4:55 PM, Lance Norskog <[email protected]> wrote: > >> Shouldn't this be 'unclassified'? I think I have seen data in the >> unclassified buckets with both Bayes and SGD. >> >> ----- Original Message ----- >> | From: "Ted Dunning" <[email protected]> >> | To: [email protected] >> | Sent: Wednesday, September 19, 2012 2:54:25 PM >> | Subject: Re: The default category of a binary classifier >> | >> | If a classifier is presented text with no words in common with the >> | training >> | data, it will give you back the most common category in the training >> | data. >> | >> | That said, it is likely to be quite rare when a new document consists >> | *entirely* of new words. Any overlap with trained vocabulary is >> | likely to >> | over-ride the basic frequencies of different categories. >> | >> | On Wed, Sep 19, 2012 at 1:32 AM, Salman Mahmood >> | <[email protected]>wrote: >> | >> | > First, in mahout, is there a special way to create binary >> | > classifier? for >> | > instance if I am creating classifier for 20 news group data, I will >> | > just >> | > pass 20 as number of categories when creating the training object: >> | > >> | > new AdaptiveLogisticRegression(20, FEATURES, new L1()) >> | > >> | > Similarly when creating a binary classifier, I will pass 2 as the >> | > number >> | > of categories and thats it? >> | > >> | > Having established that, what is the default category for a binary >> | > classifier? Lets say I was building a classifier to recognize the >> | > industry >> | > sector for a news item. I have binary models to classify things >> | > into >> | > technology or not technology, banking or not banking, health or not >> | > health >> | > etc. I trained the technology model with technology related news as >> | > positive and all the other news as negative (banking and health). >> | > Now if >> | > the technology model got a news item to classify, from the media >> | > sector >> | > (which it was not trained on), what is the expected behavior? Is it >> | > gonna >> | > say it's a technology news or its not a technology news? any >> | > default >> | > behavior for unseen/untrained news items? >> | > Hope I made the question clear. >> | > Thanks >> | >>
