[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16149 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91982251 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +private def ylogy(y: Double, mu: Double): Double = { + if (y == 0) 0.0 else y * math.log(y / mu) +} + override def deviance(y: Double, mu: Double, weight: Double): Double = { --- End diff -- Fair enough. It's a minor point and not worth delaying for, thanks for your feedback! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91958478 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +private def ylogy(y: Double, mu: Double): Double = { + if (y == 0) 0.0 else y * math.log(y / mu) +} + override def deviance(y: Double, mu: Double, weight: Double): Double = { --- End diff -- Yes, but isn't handling y=0 better than introducing a tiny error in computation? there's no need to avoid computing y ln y at y=0 because it's value is well defined for this purpose. It seems quite less clear to me; I've only ever seen computing y ln y by special-casing y=0 and not restricting its domain. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91956672 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +private def ylogy(y: Double, mu: Double): Double = { + if (y == 0) 0.0 else y * math.log(y / mu) +} + override def deviance(y: Double, mu: Double, weight: Double): Double = { --- End diff -- The utility method is not incorrect, I just think the above is more clear. That is how I've seen it done in e.g. statsmodels, and I think adding a static method for this simple logic is a bit superfluous. Isn't the entire point of the static method to avoid the y=0 case, which would calculate log(0), which is undefined? Anyway, if we alternately elect to keep the `ylogy` method, I think a short doc would be helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91901487 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- I still think that the case where the weight rounds to zero (e.g. 0.3) and the case where the weight rounds to something non-zero are different cases and both should be tested. The first case we are just testing that we disregard that sample entirely, the second case we are testing that we round to an integer and then proceed with the normal calculation. One thing I can think of is that if R were doing its rounding differently (e.g. always rounding down) and we were rounding to the nearest integer, then there would be an inconsistency and we wouldn't know it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91898659 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +private def ylogy(y: Double, mu: Double): Double = { + if (y == 0) 0.0 else y * math.log(y / mu) +} + override def deviance(y: Double, mu: Double, weight: Double): Double = { --- End diff -- I would prefer to remove the added method `ylogy`, which is only used in one place, and replace it with the following logic: scala override def deviance(y: Double, mu: Double, weight: Double): Double = { val clamped = y.min(1 - epsilon).max(epsilon) 2.0 * weight * (clamped * math.log(clamped / mu) + (1 - clamped) * math.log((1 - clamped) / (1 - mu))) } I think it's better to avoid adding a static method that is only used in one other method, and additionally I think it's clearer what the intent is here. I'm open to other opinions though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91643659 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) private def ylogy(y: Double, mu: Double): Double = { - if (y == 0) { -0.0 - } else { -y * math.log(y / mu) - } + if (y == 0) 0.0 else y * math.log(y / mu) --- End diff -- I actually prefer not putting the whole thing in one line. This is more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91643515 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- If you mean input weight = 0, then it is better to be tested separately along with all the other families from the current PR. I think the current test is good for the current PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91563322 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- Isn't `round(0.3) = 0`? Or do you mean testing the raw (input) weight = 0? How is that handled? I would assume that is just leaving out that observation, since that's how R does it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91453574 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -215,6 +215,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val * Sets the value of param [[weightCol]]. * If this is not set or empty, we treat all instance weights as 1.0. * Default is not set, so all instances have weight one. + * In the Binomial model, weights correspond to number of trials. --- End diff -- We should note that the weights should therefore be integers and that they'll be rounded if they are not. Also say "Binomial family" instead of "Binomial model." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91453386 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), - Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)), + Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)), --- End diff -- Hm, so I know this is a pain, but we have special handling implemented for the weight = 0 case, but we never test it. I think we should add a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91453428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) private def ylogy(y: Double, mu: Double): Double = { - if (y == 0) { -0.0 - } else { -y * math.log(y / mu) - } + if (y == 0) 0.0 else y * math.log(y / mu) --- End diff -- Well the entire thing can go on one line, but only change it if you make another commit since it's trivial. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91440862 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- @sethah Code updated as suggested --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91432167 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- Maybe a comment here: `// weights for Binomial distribution correspond to number of trials` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91431539 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -467,10 +467,16 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +private def ylogy(y: Double, mu: Double): Double = { --- End diff -- nit: `private def ylogy(y: Double, mu: Double): Double = if (y == 0) 0.0 else y * math.log(y / mu)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91431452 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -708,73 +708,72 @@ class GeneralizedLinearRegressionSuite R code: A <- matrix(c(0, 1, 2, 3, 5, 2, 1, 3), 4, 2) - b <- c(1, 0, 1, 0) + b <- c(1, 0.5, 1, 0) w <- c(1, 2, 3, 4) df <- as.data.frame(cbind(A, b)) */ val datasetWithWeight = Seq( Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), - Instance(0.0, 2.0, Vectors.dense(1.0, 2.0)), + Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)), --- End diff -- Let's add a fractional weight as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91246870 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- @srowen The current uses [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Binomial.scala) for computing the Binomial density, which does not allow 0 trials. `require(n > 0, "n must be positive!")`. That is why I have to handle the case of `wt = 0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91168760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- SGTM assuming tests pass. I'm still curious if we need to special case wt=0 but it's fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r91021502 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- So I think the real issue here is that we don't currently allow users to specify a binomial GLM using success/outcome pairs. One way to mash that kind of grouped data into the format Spark requires is using the process described above by @actuaryzhang, but then we need to adjust the log-likelihood computation as was also noted. So @srowen is correct in saying that this is inaccurate for non-integer weights. I checked with R's glmnet, and it seems that they obey the semantics of data weights for a binomial GLM corresponding to the number of successes. So they log a warning when you input data weights of non-integer values, then proceed with the method proposed in this patch. So, this actually _does_ match R's behavior and I am in favor of the change. But we need to log appropriate warnings and write good unit tests. What are others' thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16149#discussion_r90968163 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine numInstances: Double, weightSum: Double): Double = { -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => -weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) +val wt = math.round(weight).toInt +if (wt == 0) { + 0.0 +} else { + dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) --- End diff -- I think I understand the problem, because a weighted instance is treated like _n_ copies of one instance, whose likelihood is treated like a Bernoulli trial (this could have been much simpler in the existing code right -- log(mu) or log(1-mu) depending on y?). This loses some info because y is rounded, so a weight=100 instance with y=0.5 is treated like 100 instance of y=1, when it kind of should be treated like 50 instances each of y=0 and y=1. This matters not must for non-integer weights right? you would also get a different answer in the case above. I am out of my depth here, but is this the best generalization? noticeably, you have to round the weight, making this inaccurate for non-integer weights. You have to handle wt=0 separately because dist.Binomial will reject it? Binomial with n=0 ought to be well defined. The problem is this says the log probability of an instance with weight<0.5 is always 0, but that's not 'true' -- I guess it depends on how one defines these weights. Does this match what R does or something? I'm trying to figure out if this is the right thing to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16149 [SPARK-18715][ML]Fix AIC calculations in Binomial GLM The AIC calculation in Binomial GLM seems to be off when the response or weight is non-integer: the result is different from that in R. This issue arises when one models rates, i.e, num of successes normalized over num of trials, and uses num of trials as weights. In this case, the effective likelihood is weight * label ~ binomial(weight, mu), where weight = number of trials, and weight * label = number of successes and mu = is the success rate. @srowen @sethah @yanboliang @HyukjinKwon @zhengruifeng ## What changes were proposed in this pull request? I suggest changing the current aic calculation for the Binomial family from ``` -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() ``` to the following which generalizes to the case of real-valued response and weights. ``` -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() ``` ## How was this patch tested? I will write the unit test once the community wants to include the proposed change. For now, the following modifies existing tests in weighted Binomial GLM to illustrate the issue. The second label is changed from 0 to 0.5. ``` val datasetWithWeight = Seq( (1.0, 1.0, 0.0, 5.0), (0.5, 2.0, 1.0, 2.0), (1.0, 3.0, 2.0, 1.0), (0.0, 4.0, 3.0, 3.0) ).toDF("y", "w", "x1", "x2") val formula = (new RFormula() .setFormula("y ~ x1 + x2") .setFeaturesCol("features") .setLabelCol("label")) val output = formula.fit(datasetWithWeight).transform(datasetWithWeight).select("features", "label", "w") val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("w") .setFitIntercept(false) .setRegParam(0) val model = glr.fit(output) model.summary.aic ``` The AIC from Spark is 17.3227, and the AIC from R is 15.66454. You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark aic Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16149.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16149 commit 7fdab860f740de558fa1281255b5e7dc35480d7d Author: actuaryzhang Date: 2016-12-05T18:18:07Z Fix AIC calculations in Binomial GLM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org