Re: Mahout SGD / Bayes prediction results over 20newsgroups

Josh Patterson Fri, 30 Dec 2011 10:00:45 -0800

playing with classifying some tweets with LR/SGD is yeilding in the
60s for me as well.


I'm running from the command line with "mahout runlogistic", a few
hundred training samples.

I'm continuing to play with the tuning params.

JP

On Fri, Dec 30, 2011 at 4:56 AM, Lance Norskog <[email protected]> wrote:
> examples/bin/classify-20newsgroups.sh:
>
> Naive Bayes, N-grams = 1:
> 6 minutes
> 79.9% correct
>
> Naive Bayes, N-grams = 2:
> 20 minutes
> 81.3% correct
>
> SGD with leaktype 6 (3 and 6 do the same)
> 12 minutes
> 62.3% peak, then drops to 61%
> SGD leaves a series of models after various numbers of iterations,
> showing its progression until it stops improving:
>
> /tmp/news-group-1000.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1200.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1400.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1500.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-2000.model
> Correctly Classified Instances          :       3351       44.4902%
> Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN
> /tmp/news-group-2500.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-3000.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-4000.model
> Correctly Classified Instances          :       3809       50.5709%
> Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845
> 75%-ile: -0.7226749027343783
> /tmp/news-group-5000.model
> Correctly Classified Instances          :       4386       58.2315%
> Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646
> 75%-ile: -0.24757357759053825
> /tmp/news-group-6000.model
> Correctly Classified Instances          :       4507        59.838%
> Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078
> 75%-ile: -0.18253005926770405
> /tmp/news-group-7000.model
> Correctly Classified Instances          :       4569       60.6612%
> Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432
> 75%-ile: -0.19148778067035277
> /tmp/news-group-8000.model
> Correctly Classified Instances          :       4698       62.3739%
> Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766
> 75%-ile: -0.14104508309186575
> /tmp/news-group-10000.model
> Correctly Classified Instances          :       4634       61.5242%
> Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565
> 75%-ile: -0.16246336765931288
>
> This script prints the above sequence:
> for f in /tmp/news-group-????.model /tmp/news-group-?????.model
> do
>        echo $f
>        mahout  org.apache.mahout.classifier.sgd.TestNewsGroups --input
> /tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/
> --model $f 2>/dev/null | egrep "(Correctly|Log)"
>  done
>
> On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <[email protected]> wrote:
>> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <[email protected]> wrote:
>>
>>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <[email protected]> wrote:
>>>
>>> > The Bayes in the examples doesn't work very well in the 20 newsgroups
>>> > example. Something is wrong  in the data ETL, the tuning options, or
>>> > the Bayes implementation.
>>> >
>>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <[email protected]>
>>> > wrote:
>>> > > 97% is not correct.  This sounds like you ran it on the training data.
>>> >
>>>
>>> @Ted , yes i ran it on the same training data.
>>>
>>
>> That isn't a valid test.
>>
>>
>>>
>>> > >
>>> > > 63% also sounds low.  I don't know what happened there.
>>> >
>>>
>>> Is any one tested same 20newsgrop with SGD and got better results ?
>>>
>>
>> I remember getting mid 80's.  I think that some accuracy testing is in
>> order, however, since I have seen hints that the auto-tuning is clamping
>> down too soon.
>>
>> Also, vowpal wabbit has had excellent results using one round of SGD and
>> additional rounds of L-BFGS.  That might make a very powerful version of
>> SGD that doesn't need as much of the tuning as we currently have.
>
>
>
> --
> Lance Norskog
> [email protected]



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Reply via email to