Re: SGD: mismatch in percentCorrect vs classify() on training data?

Ted Dunning Tue, 24 Jan 2012 16:04:25 -0800

One nice way to sanitize tokens is to use a dictionary to rewrite tokens as
t1, t2, t3, ...


If you can see fit to expose the data in any form, it would help us help
you.

On Tue, Jan 24, 2012 at 3:50 PM, Stuart Smith <[email protected]> wrote:

> Actually, I looked over my feature names, and I started with about 2K of
> them, not 500... and the names would need to be sanitized before I released
> them.. so...
>
>
> I did add a setPercentCorrect() to the CrossFold class, and reset the
> member to zero before I did the last training run... it then flipped the
> other way. It said percentCorrect() was about 9%, instead of the 50% I
> actually got.
>
> I guess..
>   - be sure to validate on your own test set, even your own training data
> might say something useful.
>
>   - it might be nice to add in a resetStats() method or something.
>
> It's working pretty good now!
>
>
> Thanks for the help!
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Monday, January 23, 2012 7:18 PM
> Subject: Re: SGD: mismatch in percentCorrect vs classify() on training
> data?
>
> Gotta run, but will do tmr.
>
> I actually took my feature count down from ~500 to  10, and started
> getting much better results :)
> Even with a 10% hold out set (held out from any training whatsover).
>
> So it's looking better, but that stat is still just odd... (even now)..
>
> Thanks!
>
>
> Take care,
>   -stu
>
>
>
> ________________________________
> From: Ted Dunning <[email protected]>
> To: [email protected]; Stuart Smith <[email protected]>
> Cc: Mahout List <[email protected]>
> Sent: Monday, January 23, 2012 5:52 PM
> Subject: Re: SGD: mismatch in percentCorrect vs classify() on training
> data?
>
> Hmm... I am surprised as well.
>
> As I remember percentCorrect *is* a weighted moving average so I would
> expect some discrepancy, but not this much.
>
> Can you post your training/test data somewhere?  It would be good to test
> in synchrony.
>
> On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <[email protected]> wrote:
>
> > Actually, to be clear, I looked through the CrossFoldLearner code, and
> > understand how it gets calculated.. but I'm surprised that the
> discrepancy
> > is so large..
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> >  From: Stuart Smith <[email protected]>
> > To: Mahout List <[email protected]>
> > Sent: Monday, January 23, 2012 2:54 PM
> > Subject: SGD: mismatch in percentCorrect vs classify() on training data?
> >
> > Hello,
> >
> >   I just started experimenting with the SGD/Logistic Regression
> classifier.
> > Right now I believe have too little training data for the number of
> > dimensions (~1800 vector, roughly even split between two classes, ~500
> > dimensions).
> >
> > However, I'm just trying to understand how to measure the efficacy of the
> > classifier.
> >
> > I trained a classifier like so:
> >
> > - I have two categories, "good" and "bad"
> >
> >
> > - ran AdaptiveLogisticRegression() over the training data 10 times (in
> the
> > same order)
> >
> > - get percentCorrect and AUC of the best classifier
> >
> >
> > - Took .getBest().getPayload().getLearner(), trained that over all the
> > training data again.
> >    (on the theory that ALR was only showing it a small slice of the data
> > that it had, it seemed to help).
> >
> > - get percentCorrect() of the classifier.
> >
> > - run classify() on the good/bad vectors of the training set, counting
> > FP/TP in each case.
> >
> > What I'm having trouble with is understanding a discrepancy between the
> > results of the last two steps.
> >
> > .percentCorrect() returns ~ 90%
> > M = Number of training examples
> > however (TP_Good + TP_Bad) / (M) ~ 50%
> > Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
> >
> >
> > So I'm kind of confused about what .percentCorrect means... how is this
> > counted?
> >
> > Below is a code snippet where I do the final training & counting, just in
> > case I made some bonehead mistake:
> >
> >             /** training best on all data... **/
> >             System.out.println( "Training best on all data..");
> >             ARFFVectorIterable retrainGood = new
> > ARFFVectorIterable(goodArff, new MapBackedARFFModel());
> >             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
> >             while (retrainGoodIter.hasNext()) {
> >                 bestClassifier.train( goodLabel, retrainGoodIter.next()
> );
> >             }
> >
> >
> >             ARFFVectorIterable retrainBad = new
> > ARFFVectorIterable(badArff, new MapBackedARFFModel());
> >             Iterator<Vector> retrainBadIter = retrainBad.iterator();
> >             while (retrainBadIter.hasNext()) {
> >                 bestClassifier.train( badLabel, retrainBadIter.next() );
> >             }
> >             System.out.println("Best learner percent correct on all data:
> > " + bestClassifier.percentCorrect());
> >
> >             ARFFVectorIterable fpVectors = new
> > ARFFVectorIterable(goodArff, new MapBackedARFFModel());
> >             Iterator<Vector> fpIterator = fpVectors.iterator();
> >             int goodFpCount = 0;
> >             int goodTpCount = 0;
> >             int testCount = 0;
> >             while (fpIterator.hasNext())
> >             {
> >
> >                 Vector goodVector = fpIterator.next();
> >                 double probabilityGood = (1.0 -
> > bestClassifier.classify(goodVector).get(badLabel));
> >                 testCount++;
> >                 if( probabilityGood > 0.0 ) {
> >                     if( probabilityGood <= 1.0 ) {
> >                         System.out.print( probabilityGood + "," );
> >                     }
> >                     goodTpCount++;
> >                 }
> >                 else {
> >                     goodFpCount++;
> >                 }
> >             }
> >             System.out.println();
> >             System.out.println( "FP count: " + goodFpCount );
> >             System.out.println( "TP of good files: " + goodTpCount );
> >
> >             ARFFVectorIterable tpVectors = new
> ARFFVectorIterable(badArff,
> > new MapBackedARFFModel());
> >             Iterator<Vector> tpIterator = tpVectors.iterator();
> >             int badTpCount = 0;
> >             int badFpCount = 0;
> >             while (tpIterator.hasNext())
> >             {
> >                 Vector badVector = tpIterator.next();
> >                 double probabilityBad =
> > bestClassifier.classify(badVector).get(badLabel);
> >                 testCount++;
> >                 if( probabilityBad > 0.0 ) {
> >                     if( probabilityBad <= 1.0 ) {
> >                         System.out.print( probabilityBad + "," );
> >                     }
> >                     badTpCount++;
> >                 }
> >                 else {
> >                     badFpCount++;
> >                 }
> >             }
> >             System.out.println();
> >             System.out.println( "TP count: " + badTpCount );
> >             System.out.println( "FP on bad clusters: " + badFpCount);
> >             System.out.println( "Test count: " + testCount );
> >
> >
> > Any help is appreciated!
> >
> >
> > Take care,
> >   -stu
> >
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Reply via email to