Hello, I'm using the logistic regression of Mahout (version 0.9) but when I check the created model on the same data set it was trained for, I do not see a high value for AUC. I would expect it to be very high since it is the same data set.
My data set is a CSV file with about 7 million lines and has 18 attributes, some numerical and some categorical. This is how I create the model for logistic regression (I ignore some of the attributes): $ mahout trainlogistic --input train.csv \ --output ./model \ --categories 2 \ --predictors attribute1 ... attribute15 \ --types w w w n n w w w w w w w n n n \ --target is_delayed \ --rate 100 \ --passes 2 \ --features 500000 And then when I check the AUC value using the model on the same data set: $ mahout runlogistic --input train.csv --model ./model --auc --confusion MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar AUC = 0.48 confusion: [[1703477.0, 761921.0], [3034369.0, 1137161.0]] entropy: [[NaN, NaN], [-16.5, -17.4]] 15/01/18 06:50:50 INFO driver.MahoutDriver: Program took 98213 ms (Minutes: 1.6368833333333332) I'm really confused why I only get AUC = 0.48, instead of 1.00 or something very close since it is the same data set. Do I miss something? What are the things I should check first? I tried with only a few attributes but still very low AUC, around 0.47, that means the model is almost guessing randomly, even worse, right? -- Emre