Thanks Ted for providing testcase that helped me to look into details of the problem that I am facing.
Got how to run test case using maven: mvn test -Dtest="org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest" However I could not see printf output spitted on console, so I have saved output to file. Now I will look at the results and update in case of any issue. Thanks Rajesh On Thu, Nov 1, 2012 at 1:05 PM, Rajesh Nikam <[email protected]> wrote: > Hi Mat, > > Thanks for pointing out link for JIRA for this particular case. > > Could you extend one more help: > > I have not used maven for building and running java classes. I am looking > at > http://maven.apache.org/guides/getting-started/index.html > > Could you please point out how to build & run any specific class like > OnlineLogisticRegressionTest.java from mahout. > > Thanks > Rajesh > > > On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <[email protected]>wrote: > >> Rajesh, Ted has added the test case code already >> https://issues.apache.org/jira/browse/MAHOUT-1107 >> >> On 31 October 2012 05:14, Rajesh Nikam <[email protected]> wrote: >> >> > Hi Ted, >> > >> > Please update once JIRA and test case is uploaded. >> > >> > Looking forward for your reply. >> > >> > Thanks >> > Rajesh >> > >> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <[email protected] >> > >wrote: >> > >> > > Hi Ted, >> > > >> > > Thanks for reply. I will wait for JIRA and hope to get rid of any >> > encoding >> > > issue. >> > > >> > > Thanks, >> > > Rajesh >> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <[email protected]> wrote: >> > > >> > >> OK. I am back up for air. >> > >> >> > >> Rajesh, >> > >> >> > >> As I am sure you know, most folks here contribute on their own time. >> I >> > >> have been busy with my day job and unable to help with this until >> just >> > >> now. >> > >> >> > >> I just wrote a test case that looks at the Iris data set. The >> results >> > are >> > >> categorically different from yours. >> > >> >> > >> That substantiates my original feeling that your encoding of the >> data is >> > >> problematic. I will file a JIRA and attach a test case that you can >> > look >> > >> at. Then we can see what the differences are. >> > >> >> > >> >> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <[email protected] >> > >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > Is there development happening on fixing issue with SGD that >> generates >> > >> > models which are as good as random prediction? >> > >> > >> > >> > I am not sure why such issue is not noticed and raised by others ? >> > >> > May be this specific algo is not used in practical applications. >> > >> > >> > >> > Thanks, >> > >> > Rajesh >> > >> > >> > >> > >> > >> > >> >> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning < >> > [email protected] >> > >> > >wrote: >> > >> > >> >> > >> > >>> Rajesh, >> > >> > >>> >> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes >> > through >> > >> > the >> > >> > >>> data. All produced identical results. Thus it isn't an issue >> of >> > >> SGD >> > >> > >>> converging. >> > >> > >>> >> > >> > >>> I also did a parameter scan of lambda and saw no effect. >> > >> > >>> >> > >> > >>> I also did the standard thing in R with glm and got the >> expected >> > >> > >>> (correct) >> > >> > >>> results. >> > >> > >>> >> > >> > >>> I haven't looked yet in detail, but I really suspect that the >> > >> reading >> > >> > of >> > >> > >>> the data is horked. This is exactly how that behaves. >> > >> > >>> >> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam < >> > >> [email protected]> >> > >> > >>> wrote: >> > >> > >>> >> > >> > >>> > Hi Ted, >> > >> > >>> > >> > >> > >>> > I was thinking, this might be due to having only 100 >> instances >> > for >> > >> > >>> > training. >> > >> > >>> > >> > >> > >>> > So I have created test set with two classes having ~49K >> > instances, >> > >> > >>> included >> > >> > >>> > all features as predictors. >> > >> > >>> > PFA sgd.grps.zip with test file. >> > >> > >>> > >> > >> > >>> > mahout trainlogistic --input >> > >> /usr/local/mahout/trainme/sgd-grps.csv >> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target >> class >> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 >> a3 a4 >> > >> a5 >> > >> > a6 >> > >> > >>> a7 >> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 >> > a24 >> > >> a25 >> > >> > >>> a26 >> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 >> a42 >> > >> a43 >> > >> > >>> a44 a45 >> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 >> a61 >> > >> a62 >> > >> > >>> a63 a64 >> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 >> a80 >> > >> a81 >> > >> > >>> a82 a83 >> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 >> a99 >> > >> a100 >> > >> > >>> a101 >> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 >> a114 >> > >> a115 >> > >> > >>> a116 >> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127 >> > >> > >>> > >> > >> > >>> > >> > >> > >>> > mahout runlogistic --input >> > /usr/local/mahout/trainme/sgd-grps.csv >> > >> > >>> --model >> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion >> > >> > >>> > >> > >> > >>> > Still the results are similar, it classifies everything as >> > >> class_1. >> > >> > >>> > >> > >> > >>> > AUC = 0.50 >> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]] >> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]] >> > >> > >>> > >> > >> > >>> > I am not sure why this is failing all the time. >> > >> > >>> > >> > >> > >>> > Looking forward for your reply. >> > >> > >>> > >> > >> > >>> > Thanks >> > >> > >>> > Rajesh >> > >> > >>> > >> > >> > >>> > >> > >> > >>> > >> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning < >> > >> [email protected]> >> > >> > >>> > wrote: >> > >> > >>> > >> > >> > >>> > > I would love to help and will before long. Just can't do >> it >> > in >> > >> the >> > >> > >>> first >> > >> > >>> > > part of this week. >> > >> > >>> > > >> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam < >> > >> > [email protected] >> > >> > >>> > >> > >> > >>> > > wrote: >> > >> > >>> > > >> > >> > >>> > > > Hello, >> > >> > >>> > > > >> > >> > >>> > > > I have asked below question on issue with using sgd on >> > mahout >> > >> > >>> forum. >> > >> > >>> > > > >> > >> > >>> > > > Similar issue with sgd is reported by >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > > >> > >> > >>> > >> > >> > >>> >> > >> > >> > >> >> > >> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout >> > >> > >>> > > > >> > >> > >>> > > > Even below link has similar output: >> > >> > >>> > > > >> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]* >> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]] >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > >> > >> > >>> >> > >> > >> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html >> > >> > >>> > > > >> > >> > >>> > > > I am still wannder confusion how then this model works >> and >> > >> used >> > >> > by >> > >> > >>> > many ? >> > >> > >>> > > > Not able to get any points on how to use SGD that >> generates >> > >> > >>> effective >> > >> > >>> > > > model. >> > >> > >>> > > > >> > >> > >>> > > > Could someone point out what is missing in input file or >> > >> provided >> > >> > >>> > > > parameters. >> > >> > >>> > > > >> > >> > >>> > > > I appreciate your help. >> > >> > >>> > > > >> > >> > >>> > > > Below is description of steps that I followed. >> > >> > >>> > > > >> > >> > >>> > > > PF Attached uses input files for experiment. >> > >> > >>> > > > >> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall. >> PFA >> > >> > >>> iris.arff. >> > >> > >>> > > > Converted this to csv file just by updating header: >> > >> > >>> iris-3-classes.csv >> > >> > >>> > > > >> > >> > >>> > > > mahout org.apache.mahout.classifier. >> > >> > >>> > > > sgd.TrainLogistic --input >> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 >> > >> --output >> > >> > >>> > > /usr/local/mahout/trunk/ >> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3* >> > >> > --predictors >> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n >> > >> > >>> > > > >> > >> > >>> > > > >> it gave following error. >> > >> > >>> > > > Exception in thread "main" >> > java.lang.IllegalArgumentException: >> > >> > Can >> > >> > >>> only >> > >> > >>> > > > call classifyScalar with two categories >> > >> > >>> > > > >> > >> > >>> > > > Now created csv with only 2 classes. PFA >> iris-2-classes.csv >> > >> > >>> > > > >> > >> > >>> > > > >> trained iris-2-classes.csv with sgd >> > >> > >>> > > > >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic >> > --input >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 >> > >> > --output >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target >> > class >> > >> > >>> > > *--categories >> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength >> > petalwidth >> > >> > >>> --types n >> > >> > >>> > > > >> > >> > >>> > > > mahout runlogistic --input >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model >> --auc >> > >> > >>> --confusion >> > >> > >>> > > > >> > >> > >>> > > > AUC = 0.14 >> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]] >> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]] >> > >> > >>> > > > >> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors >> > >> > >>> > > > >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic >> > --input >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 >> > >> > --output >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target >> > class >> > >> > >>> > > *--categories >> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n >> > >> > >>> > > > >> > >> > >>> > > > mahout runlogistic --input >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model >> --auc >> > >> > >>> --confusion >> > >> > >>> > > > --scores >> > >> > >>> > > > >> > >> > >>> > > > AUC = 0.80 >> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]* >> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]] >> > >> > >>> > > > >> > >> > >>> > > > This model classifies everything as category 1 which of >> no >> > >> use. >> > >> > >>> > > > >> > >> > >>> > > > Thanks >> > >> > >>> > > > Rajesh >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > > > >> > >> > >>> > > >> > >> > >>> > >> > >> > >>> >> > >> > >> >> > >> > >> >> > >> > > >> > >> > >> > >> >> > > >> > >> > >
