examples/bin/classify-20newsgroups.sh:
Naive Bayes, N-grams = 1:
6 minutes
79.9% correct
Naive Bayes, N-grams = 2:
20 minutes
81.3% correct
SGD with leaktype 6 (3 and 6 do the same)
12 minutes
62.3% peak, then drops to 61%
SGD leaves a series of models after various numbers of iterations,
showing its progression until it stops improving:
/tmp/news-group-1000.model
Correctly Classified Instances : 2859 37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1200.model
Correctly Classified Instances : 2859 37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1400.model
Correctly Classified Instances : 2859 37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1500.model
Correctly Classified Instances : 2859 37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-2000.model
Correctly Classified Instances : 3351 44.4902%
Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN
/tmp/news-group-2500.model
Correctly Classified Instances : 3940 52.3101%
Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
75%-ile: -0.6282543277378467
/tmp/news-group-3000.model
Correctly Classified Instances : 3940 52.3101%
Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
75%-ile: -0.6282543277378467
/tmp/news-group-4000.model
Correctly Classified Instances : 3809 50.5709%
Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845
75%-ile: -0.7226749027343783
/tmp/news-group-5000.model
Correctly Classified Instances : 4386 58.2315%
Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646
75%-ile: -0.24757357759053825
/tmp/news-group-6000.model
Correctly Classified Instances : 4507 59.838%
Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078
75%-ile: -0.18253005926770405
/tmp/news-group-7000.model
Correctly Classified Instances : 4569 60.6612%
Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432
75%-ile: -0.19148778067035277
/tmp/news-group-8000.model
Correctly Classified Instances : 4698 62.3739%
Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766
75%-ile: -0.14104508309186575
/tmp/news-group-10000.model
Correctly Classified Instances : 4634 61.5242%
Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565
75%-ile: -0.16246336765931288
This script prints the above sequence:
for f in /tmp/news-group-????.model /tmp/news-group-?????.model
do
echo $f
mahout org.apache.mahout.classifier.sgd.TestNewsGroups --input
/tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/
--model $f 2>/dev/null | egrep "(Correctly|Log)"
done
On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <[email protected]> wrote:
> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <[email protected]> wrote:
>
>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <[email protected]> wrote:
>>
>> > The Bayes in the examples doesn't work very well in the 20 newsgroups
>> > example. Something is wrong in the data ETL, the tuning options, or
>> > the Bayes implementation.
>> >
>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <[email protected]>
>> > wrote:
>> > > 97% is not correct. This sounds like you ran it on the training data.
>> >
>>
>> @Ted , yes i ran it on the same training data.
>>
>
> That isn't a valid test.
>
>
>>
>> > >
>> > > 63% also sounds low. I don't know what happened there.
>> >
>>
>> Is any one tested same 20newsgrop with SGD and got better results ?
>>
>
> I remember getting mid 80's. I think that some accuracy testing is in
> order, however, since I have seen hints that the auto-tuning is clamping
> down too soon.
>
> Also, vowpal wabbit has had excellent results using one round of SGD and
> additional rounds of L-BFGS. That might make a very powerful version of
> SGD that doesn't need as much of the tuning as we currently have.
--
Lance Norskog
[email protected]