playing with classifying some tweets with LR/SGD is yeilding in the 60s for me as well.
I'm running from the command line with "mahout runlogistic", a few hundred training samples. I'm continuing to play with the tuning params. JP On Fri, Dec 30, 2011 at 4:56 AM, Lance Norskog <[email protected]> wrote: > examples/bin/classify-20newsgroups.sh: > > Naive Bayes, N-grams = 1: > 6 minutes > 79.9% correct > > Naive Bayes, N-grams = 2: > 20 minutes > 81.3% correct > > SGD with leaktype 6 (3 and 6 do the same) > 12 minutes > 62.3% peak, then drops to 61% > SGD leaves a series of models after various numbers of iterations, > showing its progression until it stops improving: > > /tmp/news-group-1000.model > Correctly Classified Instances : 2859 37.958% > Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284 > 75%-ile: -2.5782489881172115 > /tmp/news-group-1200.model > Correctly Classified Instances : 2859 37.958% > Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284 > 75%-ile: -2.5782489881172115 > /tmp/news-group-1400.model > Correctly Classified Instances : 2859 37.958% > Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284 > 75%-ile: -2.5782489881172115 > /tmp/news-group-1500.model > Correctly Classified Instances : 2859 37.958% > Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284 > 75%-ile: -2.5782489881172115 > /tmp/news-group-2000.model > Correctly Classified Instances : 3351 44.4902% > Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN > /tmp/news-group-2500.model > Correctly Classified Instances : 3940 52.3101% > Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904 > 75%-ile: -0.6282543277378467 > /tmp/news-group-3000.model > Correctly Classified Instances : 3940 52.3101% > Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904 > 75%-ile: -0.6282543277378467 > /tmp/news-group-4000.model > Correctly Classified Instances : 3809 50.5709% > Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845 > 75%-ile: -0.7226749027343783 > /tmp/news-group-5000.model > Correctly Classified Instances : 4386 58.2315% > Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646 > 75%-ile: -0.24757357759053825 > /tmp/news-group-6000.model > Correctly Classified Instances : 4507 59.838% > Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078 > 75%-ile: -0.18253005926770405 > /tmp/news-group-7000.model > Correctly Classified Instances : 4569 60.6612% > Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432 > 75%-ile: -0.19148778067035277 > /tmp/news-group-8000.model > Correctly Classified Instances : 4698 62.3739% > Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766 > 75%-ile: -0.14104508309186575 > /tmp/news-group-10000.model > Correctly Classified Instances : 4634 61.5242% > Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565 > 75%-ile: -0.16246336765931288 > > This script prints the above sequence: > for f in /tmp/news-group-????.model /tmp/news-group-?????.model > do > echo $f > mahout org.apache.mahout.classifier.sgd.TestNewsGroups --input > /tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/ > --model $f 2>/dev/null | egrep "(Correctly|Log)" > done > > On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <[email protected]> wrote: >> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <[email protected]> wrote: >> >>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <[email protected]> wrote: >>> >>> > The Bayes in the examples doesn't work very well in the 20 newsgroups >>> > example. Something is wrong in the data ETL, the tuning options, or >>> > the Bayes implementation. >>> > >>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <[email protected]> >>> > wrote: >>> > > 97% is not correct. This sounds like you ran it on the training data. >>> > >>> >>> @Ted , yes i ran it on the same training data. >>> >> >> That isn't a valid test. >> >> >>> >>> > > >>> > > 63% also sounds low. I don't know what happened there. >>> > >>> >>> Is any one tested same 20newsgrop with SGD and got better results ? >>> >> >> I remember getting mid 80's. I think that some accuracy testing is in >> order, however, since I have seen hints that the auto-tuning is clamping >> down too soon. >> >> Also, vowpal wabbit has had excellent results using one round of SGD and >> additional rounds of L-BFGS. That might make a very powerful version of >> SGD that doesn't need as much of the tuning as we currently have. > > > > -- > Lance Norskog > [email protected] -- Twitter: @jpatanooga Solution Architect @ Cloudera hadoop: http://www.cloudera.com
