[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16149


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91982251
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+private def ylogy(y: Double, mu: Double): Double = {
+  if (y == 0) 0.0 else y * math.log(y / mu)
+}
+
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
--- End diff --

Fair enough. It's a minor point and not worth delaying for, thanks for your 
feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-12 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91958478
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+private def ylogy(y: Double, mu: Double): Double = {
+  if (y == 0) 0.0 else y * math.log(y / mu)
+}
+
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
--- End diff --

Yes, but isn't handling y=0 better than introducing a tiny error in 
computation? there's no need to avoid computing y ln y at y=0 because it's 
value is well defined for this purpose. It seems quite less clear to me; I've 
only ever seen computing y ln y by special-casing y=0 and not restricting its 
domain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91956672
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+private def ylogy(y: Double, mu: Double): Double = {
+  if (y == 0) 0.0 else y * math.log(y / mu)
+}
+
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
--- End diff --

The utility method is not incorrect, I just think the above is more clear. 
That is how I've seen it done in e.g. statsmodels, and I think adding a static 
method for this simple logic is a bit superfluous. Isn't the entire point of 
the static method to avoid the y=0 case, which would calculate log(0), which is 
undefined?

Anyway, if we alternately elect to keep the `ylogy` method, I think a short 
doc would be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91901487
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

I still think that the case where the weight rounds to zero (e.g. 0.3) and 
the case where the weight rounds to something non-zero are different cases and 
both should be tested. The first case we are just testing that we disregard 
that sample entirely, the second case we are testing that we round to an 
integer and then proceed with the normal calculation. One thing I can think of 
is that if R were doing its rounding differently (e.g. always rounding down) 
and we were rounding to the nearest integer, then there would be an 
inconsistency and we wouldn't know it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91898659
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -467,10 +469,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+private def ylogy(y: Double, mu: Double): Double = {
+  if (y == 0) 0.0 else y * math.log(y / mu)
+}
+
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
--- End diff --

I would prefer to remove the added method `ylogy`, which is only used in 
one place, and replace it with the following logic:

scala
override def deviance(y: Double, mu: Double, weight: Double): Double = {
  val clamped = y.min(1 - epsilon).max(epsilon)
  2.0 * weight * (clamped * math.log(clamped / mu) + (1 - clamped) * 
math.log((1 - clamped) / (1 - mu)))
}


I think it's better to avoid adding a static method that is only used in 
one other method, and additionally I think it's clearer what the intent is 
here. I'm open to other opinions though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91643659
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
 private def ylogy(y: Double, mu: Double): Double = {
-  if (y == 0) {
-0.0
-  } else {
-y * math.log(y / mu)
-  }
+  if (y == 0) 0.0 else y * math.log(y / mu)
--- End diff --

I actually prefer not putting the whole thing in one line. This is more 
readable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91643515
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

If you mean input weight = 0, then it is better to be tested separately 
along with all the other families from the current PR. I think the current test 
is good for the current PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91563322
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

Isn't `round(0.3) = 0`? 
Or do you mean testing the raw (input) weight  = 0? How is that handled? I 
would assume that is just leaving out that observation, since that's how R does 
it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91453574
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -215,6 +215,7 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
* Sets the value of param [[weightCol]].
* If this is not set or empty, we treat all instance weights as 1.0.
* Default is not set, so all instances have weight one.
+   * In the Binomial model, weights correspond to number of trials.
--- End diff --

We should note that the weights should therefore be integers and that 
they'll be rounded if they are not. Also say "Binomial family" instead of 
"Binomial model."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91453386
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -715,7 +715,7 @@ class GeneralizedLinearRegressionSuite
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
   Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
-  Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
+  Instance(1.0, 0.3, Vectors.dense(2.0, 1.0)),
--- End diff --

Hm, so I know this is a pain, but we have special handling implemented for 
the weight = 0 case, but we never test it. I think we should add a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91453428
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -468,11 +469,7 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
 private def ylogy(y: Double, mu: Double): Double = {
-  if (y == 0) {
-0.0
-  } else {
-y * math.log(y / mu)
-  }
+  if (y == 0) 0.0 else y * math.log(y / mu)
--- End diff --

Well the entire thing can go on one line, but only change it if you make 
another commit since it's trivial.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91440862
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

@sethah Code updated as suggested


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91432167
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +485,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

Maybe a comment here: `// weights for Binomial distribution correspond to 
number of trials`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91431539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -467,10 +467,16 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+private def ylogy(y: Double, mu: Double): Double = {
--- End diff --

nit: `private def ylogy(y: Double, mu: Double): Double = if (y == 0) 0.0 
else y * math.log(y / mu)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91431452
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -708,73 +708,72 @@ class GeneralizedLinearRegressionSuite
R code:
 
A <- matrix(c(0, 1, 2, 3, 5, 2, 1, 3), 4, 2)
-   b <- c(1, 0, 1, 0)
+   b <- c(1, 0.5, 1, 0)
w <- c(1, 2, 3, 4)
df <- as.data.frame(cbind(A, b))
  */
 val datasetWithWeight = Seq(
   Instance(1.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
-  Instance(0.0, 2.0, Vectors.dense(1.0, 2.0)),
+  Instance(0.5, 2.0, Vectors.dense(1.0, 2.0)),
--- End diff --

Let's add a fractional weight as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91246870
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --


@srowen The current uses 
[breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Binomial.scala)
 for computing the Binomial density, which does not allow 0 trials. `require(n 
> 0, "n must be positive!")`. That is why I have to handle the case of `wt = 0` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91168760
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

SGTM assuming tests pass. I'm still curious if we need to special case wt=0 
but it's fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r91021502
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

So I think the real issue here is that we don't currently allow users to 
specify a binomial GLM using success/outcome pairs. One way to mash that kind 
of grouped data into the format Spark requires is using the process described 
above by @actuaryzhang, but then we need to adjust the log-likelihood 
computation as was also noted. 

So @srowen is correct in saying that this is inaccurate for non-integer 
weights. I checked with R's glmnet, and it seems that they obey the semantics 
of data weights for a binomial GLM corresponding to the number of successes. So 
they log a warning when you input data weights of non-integer values, then 
proceed with the method proposed in this patch. 

So, this actually _does_ match R's behavior and I am in favor of the 
change. But we need to log appropriate warnings and write good unit tests. What 
are others' thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16149#discussion_r90968163
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 numInstances: Double,
 weightSum: Double): Double = {
   -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
-weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
+val wt = math.round(weight).toInt
+if (wt == 0) {
+  0.0
+} else {
+  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
--- End diff --

I think I understand the problem, because a weighted instance is treated 
like _n_ copies of one instance, whose likelihood is treated like a Bernoulli 
trial (this could have been much simpler in the existing code right -- log(mu) 
or log(1-mu) depending on y?). 

This loses some info because y is rounded, so a weight=100 instance with 
y=0.5 is treated like 100 instance of y=1, when it kind of should be treated 
like 50 instances each of y=0 and y=1.

This matters not must for non-integer weights right? you would also get a 
different answer in the case above.

I am out of my depth here, but is this the best generalization? noticeably, 
you have to round the weight, making this inaccurate for non-integer weights.

You have to handle wt=0 separately because dist.Binomial will reject it? 
Binomial with n=0 ought to be well defined. The problem is this says the log 
probability of an instance with weight<0.5 is always 0, but that's not 'true' 
-- I guess it depends on how one defines these weights.

Does this match what R does or something? I'm trying to figure out if this 
is the right thing to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16149: [SPARK-18715][ML]Fix AIC calculations in Binomial...

2016-12-05 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16149

[SPARK-18715][ML]Fix AIC calculations in Binomial GLM

The AIC calculation in Binomial GLM seems to be off when the response or 
weight is non-integer: the result is different from that in R. This issue 
arises when one models rates, i.e, num of successes normalized over num of 
trials, and uses num of trials as weights. In this case, the effective 
likelihood is  weight * label ~ binomial(weight, mu), where weight = number of 
trials, and weight * label = number of successes and mu = is the success rate.  

@srowen @sethah @yanboliang @HyukjinKwon @zhengruifeng 

## What changes were proposed in this pull request?
I suggest changing the current aic calculation for the Binomial family from 
```
-2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
  }.sum()
```
to the following which generalizes to the case of real-valued response and 
weights. 
```
  -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) 
=>
val wt = math.round(weight).toInt
if (wt == 0){
  0.0
} else {
  dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
}
  }.sum()
``` 
## How was this patch tested?
I will write the unit test once the community wants to include the proposed 
change. For now, the following modifies existing tests in weighted Binomial GLM 
to illustrate the issue. The second label is changed from 0 to 0.5. 

```
val datasetWithWeight = Seq(
(1.0, 1.0, 0.0, 5.0),
(0.5, 2.0, 1.0, 2.0),
(1.0, 3.0, 2.0, 1.0),
(0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = 
formula.fit(datasetWithWeight).transform(datasetWithWeight).select("features", 
"label", "w")

val glr = new GeneralizedLinearRegression()
.setFamily("binomial")
.setWeightCol("w")
.setFitIntercept(false)
.setRegParam(0)

val model = glr.fit(output)
model.summary.aic
```
The AIC from Spark is 17.3227, and the AIC from R is 15.66454. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark aic

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16149


commit 7fdab860f740de558fa1281255b5e7dc35480d7d
Author: actuaryzhang 
Date:   2016-12-05T18:18:07Z

Fix AIC calculations in Binomial GLM




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org