Hi Ted, Thanks for reply. I will wait for JIRA and hope to get rid of any encoding issue.
Thanks, Rajesh On Oct 31, 2012 5:24 AM, "Ted Dunning" <[email protected]> wrote: > OK. I am back up for air. > > Rajesh, > > As I am sure you know, most folks here contribute on their own time. I > have been busy with my day job and unable to help with this until just now. > > I just wrote a test case that looks at the Iris data set. The results are > categorically different from yours. > > That substantiates my original feeling that your encoding of the data is > problematic. I will file a JIRA and attach a test case that you can look > at. Then we can see what the differences are. > > > On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <[email protected]> > wrote: > > > Hi, > > > > Is there development happening on fixing issue with SGD that generates > > models which are as good as random prediction? > > > > I am not sure why such issue is not noticed and raised by others ? > > May be this specific algo is not used in practical applications. > > > > Thanks, > > Rajesh > > > > > > >> > > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <[email protected] > > >wrote: > > >> > > >>> Rajesh, > > >>> > > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes through > > the > > >>> data. All produced identical results. Thus it isn't an issue of SGD > > >>> converging. > > >>> > > >>> I also did a parameter scan of lambda and saw no effect. > > >>> > > >>> I also did the standard thing in R with glm and got the expected > > >>> (correct) > > >>> results. > > >>> > > >>> I haven't looked yet in detail, but I really suspect that the reading > > of > > >>> the data is horked. This is exactly how that behaves. > > >>> > > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <[email protected] > > > > >>> wrote: > > >>> > > >>> > Hi Ted, > > >>> > > > >>> > I was thinking, this might be due to having only 100 instances for > > >>> > training. > > >>> > > > >>> > So I have created test set with two classes having ~49K instances, > > >>> included > > >>> > all features as predictors. > > >>> > PFA sgd.grps.zip with test file. > > >>> > > > >>> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv > > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class > > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 > > a6 > > >>> a7 > > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 > a25 > > >>> a26 > > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 > > >>> a44 a45 > > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62 > > >>> a63 a64 > > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81 > > >>> a82 a83 > > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 > a100 > > >>> a101 > > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 > a115 > > >>> a116 > > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127 > > >>> > > > >>> > > > >>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv > > >>> --model > > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion > > >>> > > > >>> > Still the results are similar, it classifies everything as class_1. > > >>> > > > >>> > AUC = 0.50 > > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]] > > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]] > > >>> > > > >>> > I am not sure why this is failing all the time. > > >>> > > > >>> > Looking forward for your reply. > > >>> > > > >>> > Thanks > > >>> > Rajesh > > >>> > > > >>> > > > >>> > > > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning < > [email protected]> > > >>> > wrote: > > >>> > > > >>> > > I would love to help and will before long. Just can't do it in > the > > >>> first > > >>> > > part of this week. > > >>> > > > > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam < > > [email protected] > > >>> > > > >>> > > wrote: > > >>> > > > > >>> > > > Hello, > > >>> > > > > > >>> > > > I have asked below question on issue with using sgd on mahout > > >>> forum. > > >>> > > > > > >>> > > > Similar issue with sgd is reported by > > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout > > >>> > > > > > >>> > > > Even below link has similar output: > > >>> > > > > > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]* > > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]] > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > >>> > > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html > > >>> > > > > > >>> > > > I am still wannder confusion how then this model works and used > > by > > >>> > many ? > > >>> > > > Not able to get any points on how to use SGD that generates > > >>> effective > > >>> > > > model. > > >>> > > > > > >>> > > > Could someone point out what is missing in input file or > provided > > >>> > > > parameters. > > >>> > > > > > >>> > > > I appreciate your help. > > >>> > > > > > >>> > > > Below is description of steps that I followed. > > >>> > > > > > >>> > > > PF Attached uses input files for experiment. > > >>> > > > > > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA > > >>> iris.arff. > > >>> > > > Converted this to csv file just by updating header: > > >>> iris-3-classes.csv > > >>> > > > > > >>> > > > mahout org.apache.mahout.classifier. > > >>> > > > sgd.TrainLogistic --input > > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output > > >>> > > /usr/local/mahout/trunk/ > > >>> > > > *iris-3-classes.model* --target class *--categories 3* > > --predictors > > >>> > > > sepallength sepalwidth petallength petalwidth --types n > > >>> > > > > > >>> > > > >> it gave following error. > > >>> > > > Exception in thread "main" java.lang.IllegalArgumentException: > > Can > > >>> only > > >>> > > > call classifyScalar with two categories > > >>> > > > > > >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv > > >>> > > > > > >>> > > > >> trained iris-2-classes.csv with sgd > > >>> > > > > > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 > > --output > > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class > > >>> > > *--categories > > >>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth > > >>> --types n > > >>> > > > > > >>> > > > mahout runlogistic --input > > >>> /usr/local/mahout/trunk/iris-2-classes.csv > > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc > > >>> --confusion > > >>> > > > > > >>> > > > AUC = 0.14 > > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]] > > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]] > > >>> > > > > > >>> > > > >> AUC seems to poor. Now changed --predictors > > >>> > > > > > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 > > --output > > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class > > >>> > > *--categories > > >>> > > > 2* --predictors sepalwidth petallength --types n > > >>> > > > > > >>> > > > mahout runlogistic --input > > >>> /usr/local/mahout/trunk/iris-2-classes.csv > > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc > > >>> --confusion > > >>> > > > --scores > > >>> > > > > > >>> > > > AUC = 0.80 > > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]* > > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]] > > >>> > > > > > >>> > > > This model classifies everything as category 1 which of no use. > > >>> > > > > > >>> > > > Thanks > > >>> > > > Rajesh > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > > > > >
