Wayne Zhang created SPARK-18715: ----------------------------------- Summary: Correct AIC calculation in Binomial GLM Key: SPARK-18715 URL: https://issues.apache.org/jira/browse/SPARK-18715 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang Priority: Critical Fix For: 2.2.0
The AIC calculation in Binomial GLM seems to be wrong when there are weights. The weight adjustment should be applied to only the part of the Binomial density involving the parameters, not the normalizing constant. The current implementation is: {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() {code} Suggest changing this to {code} -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} ---- ---- The following is an example to illustrate the problem. {code} val dataset = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(0.5, Vectors.dense(12, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(0.5, Vectors.dense(16, 1.0)) ).toDF().withColumn("weight", col("label") + 1.0) val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("weight") .setRegParam(0) val model = glr.fit(dataset) model.summary.aic {code} This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. {code} da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") 0,18,1,1 0.5,12,0,1.5 1,15,0,2 0,13,2,1 0,15,1,1 0.5,16,1,1.5 da <- as.data.frame(da) f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) AIC(f) -2 * logLik(f) {code} Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R. {code} val predictions = model.transform(dataset) -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org