Wayne Zhang created SPARK-18715:
-----------------------------------

             Summary: Correct AIC calculation in Binomial GLM
                 Key: SPARK-18715
                 URL: https://issues.apache.org/jira/browse/SPARK-18715
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.0.2
            Reporter: Wayne Zhang
            Priority: Critical
             Fix For: 2.2.0


The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
The weight adjustment should be applied to only the part of the Binomial 
density involving the parameters, not the normalizing constant. 

The current implementation is:
{code}
      -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
      }.sum()
{code} 

Suggest changing this to 
{code}
      -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        val wt = math.round(weight).toInt
        if (wt == 0){
          0.0
        } else {
          dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
        }
      }.sum()
{code} 

----
----
The following is an example to illustrate the problem.
{code}
val dataset = Seq(
      LabeledPoint(0.0, Vectors.dense(18, 1.0)),
      LabeledPoint(0.5, Vectors.dense(12, 0.0)),
      LabeledPoint(1.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(13, 2.0)),
      LabeledPoint(0.0, Vectors.dense(15, 1.0)),
      LabeledPoint(0.5, Vectors.dense(16, 1.0))
    ).toDF().withColumn("weight", col("label") + 1.0)
val glr = new GeneralizedLinearRegression()
    .setFamily("binomial")
    .setWeightCol("weight")
    .setRegParam(0)
val model = glr.fit(dataset)
model.summary.aic
{code}

This calculation shows the AIC is 14.189026847171382. To verify whether this is 
correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 
5.660918. 
{code}
da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
0,18,1,1
0.5,12,0,1.5
1,15,0,2
0,13,2,1
0,15,1,1
0.5,16,1,1.5
da <- as.data.frame(da)
f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
AIC(f)
-2 * logLik(f)
{code}

Now, I check whether the proposed change is correct. The following calculates 
-2 * LogLik manually and get 5.6609177228379055, the same as that in R.
{code}
val predictions = model.transform(dataset)
-2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: 
Double, mu: Double, weight: Double) =>
      val wt = math.round(weight).toInt
      if (wt == 0){
        0.0
      } else {
        dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
      }
  }.sum()
{code}







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to