[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16344


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97227087
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1052,6 +1217,120 @@ class GeneralizedLinearRegressionSuite
 assert(summary.solver === "irls")
   }
 
+  test("glm summary: tweedie family with weight") {
+/*
+  R code:
+
+  library(statmod)
+  df <- as.data.frame(matrix(c(
+1.0, 1.0, 0.0, 5.0,
+0.5, 2.0, 1.0, 2.0,
+1.0, 3.0, 2.0, 1.0,
+0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE))
+
+  f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2,
--- End diff --

Change ```f``` to ```model```, and add ```summary(model)``` at next line. 
It's helpful for users to reproduce the result in R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97219484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -57,30 +57,72 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the error distribution 
to be used in the " +
   s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames.toArray))
+ParamValidators.inArray[String](supportedFamilyNames))
 
   /** @group getParam */
   @Since("2.0.0")
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the Tweedie family.
--- End diff --

Nit: ```Only applicable for "tweedie" family.``` should be better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97217994
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -106,11 +148,24 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   schema: StructType,
   fitting: Boolean,
   featuresDataType: DataType): StructType = {
-if (isDefined(link)) {
-  require(supportedFamilyAndLinkPairs.contains(
-Family.fromName($(family)) -> Link.fromName($(link))), 
"Generalized Linear Regression " +
-s"with ${$(family)} family does not support ${$(link)} link 
function.")
+if ($(family) == "tweedie") {
--- End diff --

```$(family).toLowerCase == "tweedie"```, see #16516, change here and other 
places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97219567
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -57,30 +57,72 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the error distribution 
to be used in the " +
   s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames.toArray))
+ParamValidators.inArray[String](supportedFamilyNames))
 
   /** @group getParam */
   @Since("2.0.0")
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the Tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported values: 0 and [1, Inf).
+   * Note that variance power 0, 1, or 2 corresponds to the Gaussian, 
Poisson or Gamma
+   * family, respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val variancePower: DoubleParam = new DoubleParam(this, 
"variancePower",
+"The power in the variance function of the Tweedie distribution which 
characterizes " +
+"the relationship between the variance and mean of the distribution. " 
+
+"Used only for the Tweedie family. Supported values: 0 and [1, Inf).",
+(x: Double) => x >= 1.0 || x == 0.0)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getVariancePower: Double = $(variancePower)
+
+  /**
* Param for the name of link function which provides the relationship
* between the linear predictor and the mean of the distribution 
function.
* Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * This is used only when family is not "tweedie". The link function for 
the "tweedie" family
+   * must be specified through [[linkPower]].
*
* @group param
*/
   @Since("2.0.0")
   final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
 "which provides the relationship between the linear predictor and the 
mean of the " +
 s"distribution function. Supported options: 
${supportedLinkNames.mkString(", ")}",
-ParamValidators.inArray[String](supportedLinkNames.toArray))
+ParamValidators.inArray[String](supportedLinkNames))
 
   /** @group getParam */
   @Since("2.0.0")
   def getLink: String = $(link)
 
   /**
+   * Param for the index in the power link function. This is used to 
specify the link function
+   * in the Tweedie family.
--- End diff --

```This is used to specify the link function in the Tweedie family.``` -> 
```Only applicable for "tweedie" family.``` I think we should highlight that it 
ONLY takes effect when family == "tweedie".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97219089
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -578,6 +580,169 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
+  test("generalized linear regression: tweedie family against glm") {
+/*
+  R code:
+  library(statmod)
+  df <- as.data.frame(matrix(c(
+1.0, 1.0, 0.0, 5.0,
+0.5, 1.0, 1.0, 2.0,
+1.0, 1.0, 2.0, 1.0,
+2.0, 1.0, 3.0, 3.0), 4, 4, byrow = TRUE))
+
+  f1 <- V1 ~ -1 + V3 + V4
+  f2 <- V1 ~ V3 + V4
+
+  for (f in c(f1, f2)) {
+for (lp in c(0, 1, -1))
+  for (vp in c(1.6, 2.5)) {
+model <- glm(f, df, family = tweedie(var.power = vp, 
link.power = lp))
+print(as.vector(coef(model)))
+  }
+  }
+  [1] 0.1496480 -0.0122283
+  [1] 0.1373567 -0.0120673
+  [1] 0.3919109 0.1846094
+  [1] 0.3684426 0.1810662
+  [1] 0.1759887 0.2195818
+  [1] 0.1108561 0.2059430
+  [1] -1.3163732  0.4378139  0.2464114
+  [1] -1.4396020  0.4817364  0.2680088
+  [1] -0.7090230  0.6256309  0.3294324
+  [1] -0.9524928  0.7304267  0.3792687
+  [1] 2.1188978 -0.3360519 -0.2067023
+  [1] 2.1659028 -0.3499170 -0.2128286
+*/
+val datasetTweedie = Seq(
+  Instance(1.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(0.5, 1.0, Vectors.dense(1.0, 2.0)),
+  Instance(1.0, 1.0, Vectors.dense(2.0, 1.0)),
+  Instance(2.0, 1.0, Vectors.dense(3.0, 3.0))
+).toDF()
+
+val expected = Seq(
+  Vectors.dense(0, 0.149648, -0.0122283),
+  Vectors.dense(0, 0.1373567, -0.0120673),
+  Vectors.dense(0, 0.3919109, 0.1846094),
+  Vectors.dense(0, 0.3684426, 0.1810662),
+  Vectors.dense(0, 0.1759887, 0.2195818),
+  Vectors.dense(0, 0.1108561, 0.205943),
+  Vectors.dense(-1.3163732, 0.4378139, 0.2464114),
+  Vectors.dense(-1.439602, 0.4817364, 0.2680088),
+  Vectors.dense(-0.709023, 0.6256309, 0.3294324),
+  Vectors.dense(-0.9524928, 0.7304267, 0.3792687),
+  Vectors.dense(2.1188978, -0.3360519, -0.2067023),
+  Vectors.dense(2.1659028, -0.349917, -0.2128286))
+
+import GeneralizedLinearRegression._
+
+var idx = 0
+for (fitIntercept <- Seq(false, true); linkPower <- Seq(0.0, 1.0, 
-1.0)) {
+  for (variancePower <- Seq(1.6, 2.5)) {
--- End diff --

Nit:
```
for (fitIntercept <- Seq(false, true);
 linkPower <- Seq(0.0, 1.0, -1.0);
 variancePower <- Seq(1.6, 2.5)) {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97226910
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link or linkPower.
+ * 1) if family is "tweedie", retrieve object using linkPower
+ * 2) otherwise, retrieve object based on link name
  *
- * @param name link name: "identity", "logit", "log",
- * "inverse", "probit", "cloglog" or "sqrt".
+ * @param params the parameter map containing link and link power
--- End diff --

```link and link power``` -> ```family, link and linkPower```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97226879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link or linkPower.
--- End diff --

```
Gets the Link object based on param family, link and linkPower.
If param family was set with "tweedie", return or construct link function 
object according to linkPower; otherwise, return link function object according 
to link.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97225945
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -308,7 +380,10 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   @Since("2.0.0")
   override def load(path: String): GeneralizedLinearRegression = 
super.load(path)
 
-  /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
+  /**
+   * Set of family and link pairs that GeneralizedLinearRegression 
supports.
--- End diff --

Set of family (except for tweedie) and link pairs that 
GeneralizedLinearRegression supports.
The link function of tweedie family is specified through param "linkPower".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97219666
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -106,11 +148,24 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   schema: StructType,
   fitting: Boolean,
   featuresDataType: DataType): StructType = {
-if (isDefined(link)) {
-  require(supportedFamilyAndLinkPairs.contains(
-Family.fromName($(family)) -> Link.fromName($(link))), 
"Generalized Linear Regression " +
-s"with ${$(family)} family does not support ${$(link)} link 
function.")
+if ($(family) == "tweedie") {
+  if (isSet(link)) {
+logWarning("When family is tweedie, use param linkPower to specify 
link function. " +
+  "Setting param link will take no effect.")
+  }
+} else {
+  if (isSet(linkPower)) {
--- End diff --

Here we should add similar check for ```variancePower```, since when 
```family != "tweedie"```, both ```variancePower``` and ```linkPower``` should 
not be set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97218693
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -369,6 +446,23 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 }
   }
 
+  private[regression] object FamilyAndLink {
+
+/**
+ * Constructs the FamilyAndLink object from a parameter map
+ */
+def apply(params: GeneralizedLinearRegressionBase): FamilyAndLink = {
+  val familyObj = Family.fromParams(params)
+  val linkObj = if ((params.getFamily != "tweedie" && 
params.isDefined(params.link)) ||
--- End diff --

```isSet``` is more accurate than ```isDefined``` at here, since there is 
always no default values for both of them. I knew the original code used 
```isDefined```, but it's better we can correct them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97226609
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -409,27 +503,108 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
--- End diff --

Should the following document be better?
```
Gets the Family object based on param family and variancePower.
If param family was set with "gaussian", "binomial", "poisson" or "gamma", 
return the corresponding object directly; otherwise, construct a Tweedie object 
according to variancePower.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97218286
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -409,27 +503,108 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
--- End diff --

Revert to ```Gaussian.name``` here and bellow, which is less error-prone. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r97227237
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1052,6 +1217,120 @@ class GeneralizedLinearRegressionSuite
 assert(summary.solver === "irls")
   }
 
+  test("glm summary: tweedie family with weight") {
+/*
+  R code:
+
+  library(statmod)
+  df <- as.data.frame(matrix(c(
+1.0, 1.0, 0.0, 5.0,
+0.5, 2.0, 1.0, 2.0,
+1.0, 3.0, 2.0, 1.0,
+0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE))
+
+  f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2,
+  family = tweedie(var.power = 1.6, link.power = 0))
+
+  Deviance Residuals:
+1234
+   0.6210  -0.0515   1.6935  -3.2539
+
+  Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+  V3  -0.4087 0.5205  -0.7850.515
+  V4  -0.1212 0.4082  -0.2970.794
+
+  (Dispersion parameter for Tweedie family taken to be 3.830036)
+
+  Null deviance: 20.702  on 4  degrees of freedom
+  Residual deviance: 13.844  on 2  degrees of freedom
+  AIC: NA
+
+  Number of Fisher Scoring iterations: 11
+
+  residuals(model, type="pearson")
--- End diff --

Typos here? I guess you paste irrelevant results for the following 
```residuals```, they should be consistent with L1279-L1281.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r96061883
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
-val linkObj = if (isDefined(link)) {
-  Link.fromName($(link))
+val familyObj = Family.fromParams(this)
+val linkObj = if (isDefined(link) || isDefined(linkPower)) {
+  Link.fromParams(this)
 } else {
   familyObj.defaultLink
 }
--- End diff --

Makes sense. I created a companion object `FamilyAndLink` with a factory 
method to construct desired `FamilyAndLink` objects from the input param map. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r96061873
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link and linkPower.
+ * 1) if family is "tweedie", retrieve object using linkPower
+ * 2) otherwise, retrieve object based on link name
  *
- * @param name link name: "identity", "logit", "log",
- * "inverse", "probit", "cloglog" or "sqrt".
+ * @param params the parameter map containing link and link power
  */
-def fromName(name: String): Link = {
-  name match {
-case Identity.name => Identity
-case Logit.name => Logit
-case Log.name => Log
-case Inverse.name => Inverse
-case Probit.name => Probit
-case CLogLog.name => CLogLog
-case Sqrt.name => Sqrt
+def fromParams(params: GeneralizedLinearRegressionBase): Link = {
+  if (params.getFamily == "tweedie") {
+params.getLinkPower match {
+  case 0.0 => Log
+  case 1.0 => Identity
+  case -1.0 => Inverse
+  case 0.5 => Sqrt
+  case others => new PowerLink(others)
+}
--- End diff --

The last version of the code only allows setting `linkPower` for tweedie, 
or `link` for non-tweedie. But now since we only give warnings, I need to add 
the logic for checking the correct specification. In the `FamilyAndLink` 
companion object, I use the following logic. When family is not tweedie and 
link is set,  or when family is tweedie and linkPower is set, I will use 
`Link.fromParams` to construct the Link object. Otherwise, use the default link 
from the corresponding family.  

So, if the user does not specify `linkPower`,  the link object will be the 
default `new Power(1 - variancePower)` set in the `Tweedie` class. The test for 
`tweedie family against glm (default power link)` covers this. The reason for 
not setting a default linkPower in the param map is to mimic the existing 
behavior: for tweedie with variancePower = 0, this will be the Gaussian with 
the identity link;  for tweedie with variancePower = 1, this will be the 
Poisson with the log link; etc.   

```
 val linkObj = if ((params.getFamily != "tweedie" && 
params.isDefined(params.link)) ||
(params.getFamily == "tweedie" && 
params.isDefined(params.linkPower))) {
Link.fromParams(params)
  } else {
familyObj.defaultLink
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95994801
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
-val linkObj = if (isDefined(link)) {
-  Link.fromName($(link))
+val familyObj = Family.fromParams(this)
+val linkObj = if (isDefined(link) || isDefined(linkPower)) {
+  Link.fromParams(this)
 } else {
   familyObj.defaultLink
 }
--- End diff --

This code snippets are used multiple places, could we wrapper them up as a 
function?
```
def getFamilyAndLinkObj(params: GeneralizedLinearRegressionBase): (Family, 
Link) = {
..
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95995765
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link and linkPower.
+ * 1) if family is "tweedie", retrieve object using linkPower
+ * 2) otherwise, retrieve object based on link name
  *
- * @param name link name: "identity", "logit", "log",
- * "inverse", "probit", "cloglog" or "sqrt".
+ * @param params the parameter map containing link and link power
  */
-def fromName(name: String): Link = {
-  name match {
-case Identity.name => Identity
-case Logit.name => Logit
-case Log.name => Log
-case Inverse.name => Inverse
-case Probit.name => Probit
-case CLogLog.name => CLogLog
-case Sqrt.name => Sqrt
+def fromParams(params: GeneralizedLinearRegressionBase): Link = {
+  if (params.getFamily == "tweedie") {
+params.getLinkPower match {
+  case 0.0 => Log
+  case 1.0 => Identity
+  case -1.0 => Inverse
+  case 0.5 => Sqrt
+  case others => new PowerLink(others)
+}
--- End diff --

There is no default value for ```linkPower```, what happened if users don't 
set it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95989723
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -57,30 +57,72 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the error distribution 
to be used in the " +
   s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames.toArray))
+ParamValidators.inArray[String](supportedFamilyNames))
 
   /** @group getParam */
   @Since("2.0.0")
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the Tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported values: 0 and [1, Inf).
+   * Note that variance power 0, 1, or 2 corresponds to the Gaussian, 
Poisson or Gamma
+   * family, respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val variancePower: Param[Double] = new Param(this, "variancePower",
+"The power in the variance function of the Tweedie distribution which 
characterizes " +
+"the relationship between the variance and mean of the distribution. " 
+
+"Used only for the Tweedie family. Supported values: 0 and [1, Inf).",
+(x: Double) => x >= 1.0 || x == 0.0)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getVariancePower: Double = $(variancePower)
+
+  /**
* Param for the name of link function which provides the relationship
* between the linear predictor and the mean of the distribution 
function.
* Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * This is used only when family is not "tweedie". The link function for 
the "tweedie" family
+   * must be specified through [[linkPower]].
*
* @group param
*/
   @Since("2.0.0")
   final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
 "which provides the relationship between the linear predictor and the 
mean of the " +
 s"distribution function. Supported options: 
${supportedLinkNames.mkString(", ")}",
-ParamValidators.inArray[String](supportedLinkNames.toArray))
+ParamValidators.inArray[String](supportedLinkNames))
 
   /** @group getParam */
   @Since("2.0.0")
   def getLink: String = $(link)
 
   /**
+   * Param for the index in the power link function. This is used to 
specify the link function
+   * in the Tweedie family.
+   * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, 
Identity, Inverse or Sqrt
+   * link, respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val linkPower: Param[Double] = new Param(this, "linkPower",
+"The index in the power link function. This is used to specify the 
link function in the " +
+"Tweedie family.", (x: Double) => true)
--- End diff --

Remove ```(x: Double) => true``` since there is no validation check for 
this param.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95991458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -106,11 +148,20 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   schema: StructType,
   fitting: Boolean,
   featuresDataType: DataType): StructType = {
-if (isDefined(link)) {
-  require(supportedFamilyAndLinkPairs.contains(
-Family.fromName($(family)) -> Link.fromName($(link))), 
"Generalized Linear Regression " +
-s"with ${$(family)} family does not support ${$(link)} link 
function.")
+if ($(family) == "tweedie") {
+  require(!isDefined(link), "The link function for the tweedie family 
must be " +
+"specified using linkPower, not link.")
--- End diff --

I don't think we should throw error if users set ```link``` when family is 
set as ```tweedie```, and a warning log should be okay, like
```
if (isSet(link)) {
logWarning("When family is tweedie, use param linkPower to specify 
link function. " +
  "Setting param link will take no effect.")
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95992789
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link and linkPower.
--- End diff --

and -> or


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95991944
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -106,11 +148,20 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   schema: StructType,
   fitting: Boolean,
   featuresDataType: DataType): StructType = {
-if (isDefined(link)) {
-  require(supportedFamilyAndLinkPairs.contains(
-Family.fromName($(family)) -> Link.fromName($(link))), 
"Generalized Linear Regression " +
-s"with ${$(family)} family does not support ${$(link)} link 
function.")
+if ($(family) == "tweedie") {
+  require(!isDefined(link), "The link function for the tweedie family 
must be " +
+"specified using linkPower, not link.")
+} else {
+  require(!isDefined(linkPower), s"The link function for the 
${$(family)} family " +
+  "must be specified using link, not linkPower.")
--- End diff --

Ditto,
```
if (isSet(linkPower)) {
logWarning("When family is not tweedie, use param link to specify 
link function. " +
  "Setting param linkPower will take no effect.")
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r95993832
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   private[regression] object Link {
 
 /**
- * Gets the [[Link]] object from its name.
+ * Gets the [[Link]] object based on link and linkPower.
+ * 1) if family is "tweedie", retrieve object using linkPower
+ * 2) otherwise, retrieve object based on link name
  *
- * @param name link name: "identity", "logit", "log",
- * "inverse", "probit", "cloglog" or "sqrt".
+ * @param params the parameter map containing link and link power
  */
-def fromName(name: String): Link = {
-  name match {
-case Identity.name => Identity
-case Logit.name => Logit
-case Log.name => Log
-case Inverse.name => Inverse
-case Probit.name => Probit
-case CLogLog.name => CLogLog
-case Sqrt.name => Sqrt
+def fromParams(params: GeneralizedLinearRegressionBase): Link = {
+  if (params.getFamily == "tweedie") {
+params.getLinkPower match {
+  case 0.0 => Log
+  case 1.0 => Identity
+  case -1.0 => Inverse
+  case 0.5 => Sqrt
+  case others => new PowerLink(others)
+}
+  } else {
+params.getLink match {
+  case Identity.name => Identity
+  case Logit.name => Logit
+  case Log.name => Log
+  case Inverse.name => Inverse
+  case Probit.name => Probit
+  case CLogLog.name => CLogLog
+  case Sqrt.name => Sqrt
+}
+  }
+}
+  }
+
+  /** Power link function class */
+  private[regression] class PowerLink(val linkPower: Double)
--- End diff --

```PowerLink -> Power```, other link objects are not end with ```Link```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang
GitHub user actuaryzhang reopened a pull request:

https://github.com/apache/spark/pull/16344

[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the 
GeneralizedLinearRegression model. The Tweedie family is characterized by a 
power variance function. Currently supported distributions such as Gaussian, 
Poisson and Gamma families are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah 

I propose to add support for the other distributions:
- compound Poisson: 1 < varPower < 2. This one is widely used to model 
zero-inflated continuous distributions, e.g., in insurance, finance, ecology, 
meteorology, advertising etc.
- positive stable: varPower > 2 and varPower != 3. Used to model extreme 
values.
- inverse Gaussian: varPower = 3.

The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc.

Changes made:
- Allow `tweedie` in family. Only `identity` and `log` links are allowed 
for now. 
- Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values 
in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method.
- Add `Tweedie` class
- Add tests for tweedie GLM

Note:
- In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 
0. This is the same as in R: `tweedie()$dev.res`
- `aic` is not supported in this PR because the evaluation of the [Tweedie 
density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in 
these cases are non-trivial. I will implement the density approximation method 
in a future PR.  R returns `null` (see `tweedie()$aic`).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark tweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16344


commit 952887e485fb0d5fa669b3b4c9289b8069ee7769
Author: actuaryzhang 
Date:   2016-12-16T00:50:51Z

Add Tweedie family to GLM

commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3
Author: actuaryzhang 
Date:   2016-12-19T22:50:02Z

Fix calculation in dev resid; Add test for different var power

commit 7fe39106332663d3671b94a8ffac48ca61c48470
Author: actuaryzhang 
Date:   2016-12-19T23:14:37Z

Merge test into GLR

commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa
Author: actuaryzhang 
Date:   2016-12-20T22:59:05Z

Use Tweedie class instead of global object Tweedie; change variancePower to 
varPower

commit a8feea7d8095170c1b5f18b7887f16a6d763e42c
Author: actuaryzhang 
Date:   2016-12-21T23:42:40Z

Allow Family to use GLRBase object directly

commit 233e2d338be8d36a74eaf578bfea804ae3617d4e
Author: actuaryzhang 
Date:   2016-12-22T01:56:34Z

Add TweedieFamily and implement specific distn within Tweedie

commit 17c55816c914bc96a8b5141356e3c117f343f303
Author: actuaryzhang 
Date:   2016-12-22T04:39:54Z

Clean up doc

commit 0b41825e99020976a34d8fe9c983f26de6c8c40f
Author: actuaryzhang 
Date:   2016-12-22T17:52:01Z

Move defaultLink and name to subclass of TweedieFamily

commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac
Author: actuaryzhang 
Date:   2016-12-22T18:10:51Z

Change style for AIC

commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e
Author: actuaryzhang 
Date:   2016-12-23T19:10:20Z

Rename Family methods and restore methods for tweedie subclasses

commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e
Author: actuaryzhang 
Date:   2016-12-23T19:12:25Z

Update test

commit 9a71e89f629260c775922901a04c989f36ea4946
Author: actuaryzhang 
Date:   2016-12-27T17:16:40Z

Clean up doc

commit f461c09e65360f695ad3092b41bc26e0c61bbd95
Author: actuaryzhang 
Date:   2016-12-27T22:18:39Z

Put delta in Tweedie companion object

commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4
Author: actuaryzhang 
Date:   2016-12-27T22:23:57Z

Clean up doc

commit fab265278109eede4cce7ee506e8b29d481c4549
Author: actuaryzhang 
Date:   2017-01-05T19:32:06Z

Allow more link functions in tweedie




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, 

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang closed the pull request at:

https://github.com/apache/spark/pull/16344


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  params.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new Tweedie(default)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849540
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   /**
* A description of the error distribution to be used in the model.
*
-   * @param name the name of the family.
--- End diff --

Sorry. added this back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94849501
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   setDefault(family -> Gaussian.name)
 
   /**
+* Sets the value of param [[variancePower]].
+* Used only when family is "tweedie".
+*
+* @group setParam
+*/
+  @Since("2.2.0")
+  def setVariancePower(value: Double): this.type = set(variancePower, 
value)
+  setDefault(variancePower -> 1.5)
--- End diff --

Done. change default variancePower to 0.0, which will use Gaussian (with 
default identity link)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94773809
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  params.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new Tweedie(default)
+  }
+  }
+}
+  }
+
+  /**
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class Tweedie(private val variancePower: Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower, which becomes Identity for 
Gaussian,
+  Log for Poisson, and Inverse for Gamma. Except for these special 
cases,
+  the canonical link is rarely used. For example, the canonical link 
is 1/Sqrt
+  when variancePower = 1.5. We set Log as the default link, which may 
be overridden
+  in subclasses.
+*/
+override val defaultLink: Link = Log
--- End diff --

See my above comments for default link.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94762690
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   /**
* A description of the error distribution to be used in the model.
*
-   * @param name the name of the family.
--- End diff --

Why remove this line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94772856
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  params.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new Tweedie(default)
--- End diff --

```default``` -> ```others``` should be more clear?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94771634
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   setDefault(family -> Gaussian.name)
 
   /**
+* Sets the value of param [[variancePower]].
+* Used only when family is "tweedie".
+*
+* @group setParam
+*/
+  @Since("2.2.0")
+  def setVariancePower(value: Double): this.type = set(variancePower, 
value)
+  setDefault(variancePower -> 1.5)
--- End diff --

Why set the default value to 1.5, AFAIK, R set the default 
```variancePower``` with 0 which means gaussian family, and identity as default 
link function.
```
glm(formula = "b ~ .", family = tweedie, data = df, weights = w)
```
produces the same model with
```
glm(formula = "b ~ .", family = gaussian, data = df, weights = w)
```
[h2o.glm](https://rdrr.io/cran/h2o/man/h2o.glm.html) has the consistent 
default values with R, should we keep consistent with them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94773283
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -128,13 +152,14 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
  * Generalized linear model (Wikipedia))
  * specified by giving a symbolic description of the linear
  * predictor (link function) and a description of the error distribution 
(family).
- * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as 
family.
  * Valid link functions for each family is listed below. The first link 
function of each family
  * is the default one.
  *  - "gaussian" : "identity", "log", "inverse"
  *  - "binomial" : "logit", "probit", "cloglog"
  *  - "poisson"  : "log", "identity", "sqrt"
  *  - "gamma": "inverse", "identity", "log"
+ *  - "tweedie"  : "log", "identity"
--- End diff --

The default link for tweedie family is identity in R and H2O, I think we 
should keep consistent with them. See my comments at L193.
BTW, we can expose param ```linkPower```(```1.0 - variancePower``` as 
default value) to support other link functions except for ```log``` and 
```identity```. The link functions corresponding to "tweedie" family should be:
```
def link(mu: Double): Double = math.pow(mu, linkPower)
```
I think we should generate ```Link``` object according to the input value 
of ```linkPower``` if the ```family``` was set with "tweedie", we can follow 
the way of ```family``` to generate ```link``` object by defining a function 
like:
```
private[regression] object Link {
  def fromParams(params: GeneralizedLinearRegressionBase): Link = {
..
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r94764683
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params the parameter map containing family name and variance 
power
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromParams(params: GeneralizedLinearRegressionBase): Family = {
+  params.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  params.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new Tweedie(default)
+  }
+  }
+}
+  }
+
+  /**
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class Tweedie(private val variancePower: Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower, which becomes Identity for 
Gaussian,
+  Log for Poisson, and Inverse for Gamma. Except for these special 
cases,
+  the canonical link is rarely used. For example, the canonical link 
is 1/Sqrt
+  when variancePower = 1.5. We set Log as the default link, which may 
be overridden
+  in subclasses.
+*/
+override val defaultLink: Link = Log
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of $name($variancePower) 
family " +
+  s"should be non-negative, but got $y")
+  } else if (variancePower >= 2.0) {
+require(y > 0.0, s"The response variable of $name($variancePower) 
family " +
+  s"should be non-negative, but got $y")
--- End diff --

```y > 0.0``` means ```positive``` rather than ```non-negative```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93971228
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,21 +435,102 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param params a GenerealizedLinearRegressionBase object
--- End diff --

typo in GenerealizedLinearRegressionBase; this type of doc doesn't do 
anything though because the type is already documented. It should say something 
non-trivial


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93971681
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
+"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog,
+"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt,
+"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log,
+"tweedie" -> Identity, "tweedie" -> Log
   )
 
   /** Set of family names that GeneralizedLinearRegression supports. */
-  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1.name)
+  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1)
 
   /** Set of link names that GeneralizedLinearRegression supports. */
   private[regression] lazy val supportedLinkNames = 
supportedFamilyAndLinkPairs.map(_._2.name)
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Constant used in initialization and deviance to avoid numerical 
issues. */
+  private[regression] val delta: Double = 0.1
--- End diff --

Why not a companion object for TweedieFamily though? that just seems easy 
and more correct


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93776808
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
+  Poisson and Gamma, the canonical link is rarely used. Set Log as the 
default link.
+*/
+override val defaultLink: Link = Log
 
-val defaultLink: Link = Identity
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of the specified 
distribution " +
+  s"should be non-negative, but got $y")
+  } else if (variancePower >= 2.0) {
+require(y > 0.0, s"The response variable of the specified 
distribution " +
--- End diff --

Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93778762
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

Yeah, you are right. It needs an extra global object to avoid error-prone 
which may a little expensive. I'm ok with using string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93773154
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,28 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the Tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported value: 0 and [1, Inf). Note that when the value of the 
variance power is
+   * 0, 1, or 2, the Gaussian, Poisson or Gamma family is used, 
respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val variancePower: Param[Double] = new Param(this, "variancePower",
+"The power in the variance function of the Tweedie distribution which 
characterizes " +
+"the relationship between the variance and mean of the distribution. " 
+
+"Used for the Tweedie family. Supported value: 0 and [1, Inf).",
--- End diff --

```Used only for``` should be more clear?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93774215
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
--- End diff --

Rename to ```fromParams```, we extract ```family``` and ```variancePower``` 
from the ```Params``` which is the superclass of GLR estimator and model. And 
actually we use this function for both estimator(L279) and model(L974).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93779154
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -578,6 +578,100 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
+  test("generalized linear regression: tweedie family against glm") {
+/*
+R code:
--- End diff --

```library(statmod)``` which can help users to reproduce this test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93778869
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
+"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog,
+"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt,
+"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log,
+"tweedie" -> Identity, "tweedie" -> Log
   )
 
   /** Set of family names that GeneralizedLinearRegression supports. */
-  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1.name)
+  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1)
 
   /** Set of link names that GeneralizedLinearRegression supports. */
   private[regression] lazy val supportedLinkNames = 
supportedFamilyAndLinkPairs.map(_._2.name)
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Constant used in initialization and deviance to avoid numerical 
issues. */
+  private[regression] val delta: Double = 0.1
--- End diff --

I'm OK to put it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93775431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
--- End diff --

```TweedieFamily``` -> ```Tweedie```, we don't add suffix for other family 
class/object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93777486
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
+  Poisson and Gamma, the canonical link is rarely used. Set Log as the 
default link.
+*/
+override val defaultLink: Link = Log
 
-val defaultLink: Link = Identity
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of the specified 
distribution " +
+  s"should be non-negative, but got $y")
+  } else if (variancePower >= 2.0) {
+require(y > 0.0, s"The response variable of the specified 
distribution " +
+  s"should be non-negative, but got $y")
+  }
+  if (y == 0) delta else y
+}
 
-override def initialize(y: Double, weight: Double): Double = y
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
 
-override def variance(mu: Double): Double = 1.0
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  if (p == 0) {
+math.log(y / mu)
+  } else {
+(math.pow(y, p) - math.pow(mu, p)) / p
+  }
+}
 
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
-  weight * (y - mu) * (y - mu)
+  // Force y >= delta for Poisson or compound Poisson
+  val y1 = if (variancePower >= 1.0 && variancePower < 2.0) {
+math.max(y, delta)
+  } else {
+y
+  }
+  2.0 * weight *
+(y * yp(y1, mu, 1.0 - variancePower) - yp(y, mu, 2.0 - 
variancePower))
 }
 
 override def aic(
 predictions: RDD[(Double, Double, Double)],
 deviance: Double,
 numInstances: Double,
 weightSum: Double): Double = {
+  /*
+   This depends on the density of the Tweedie distribution.
+   Only implemented for Gaussian, Poisson and Gamma at this point.
+  */
+  throw new UnsupportedOperationException("No AIC available for the 
tweedie family")
+}
+
+override def project(mu: Double): Double = {
+  if (mu < epsilon) {
+epsilon
+  } else if (mu.isInfinity) {
+Double.MaxValue
+  } else {
+mu
+  }
+}
+  }
+
+  /**
+   * Gaussian exponential family distribution.
+   * The default link for the Gaussian family is the identity link.
+   */
+  private[regression] object Gaussian extends TweedieFamily(0.0) {
--- End 

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93776723
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
+  Poisson and Gamma, the canonical link is rarely used. Set Log as the 
default link.
+*/
+override val defaultLink: Link = Log
 
-val defaultLink: Link = Identity
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of the specified 
distribution " +
--- End diff --

```The response variable of $name family ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93773858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -128,13 +151,14 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
  * Generalized linear model (Wikipedia))
  * specified by giving a symbolic description of the linear
  * predictor (link function) and a description of the error distribution 
(family).
- * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as 
family.
  * Valid link functions for each family is listed below. The first link 
function of each family
  * is the default one.
  *  - "gaussian" : "identity", "log", "inverse"
  *  - "binomial" : "logit", "probit", "cloglog"
  *  - "poisson"  : "log", "identity", "sqrt"
  *  - "gamma": "inverse", "identity", "log"
+ *  - "tweedie"  : "identity", "log"
--- End diff --

```- "tweedie"  : "log", "identity"```, see L155: ```the first link 
function of each family is the default one```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93775763
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family("tweedie") {
+
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
--- End diff --

```The canonical link is 1 - variancePower```, could you clarify this to 
make it more clear?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93672741
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
+"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog,
+"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt,
+"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log,
+"tweedie" -> Identity, "tweedie" -> Log
   )
 
   /** Set of family names that GeneralizedLinearRegression supports. */
-  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1.name)
+  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1)
 
   /** Set of link names that GeneralizedLinearRegression supports. */
   private[regression] lazy val supportedLinkNames = 
supportedFamilyAndLinkPairs.map(_._2.name)
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Constant used in initialization and deviance to avoid numerical 
issues. */
+  private[regression] val delta: Double = 0.1
--- End diff --

They are already in the `GeneralizedLinearRegression` object, aren't they? 
Or do you mean creating a new object say `Constant` that stores these two 
constants, and using them like `Constant.delta`? 

Since `delta` is only used in the `TweedieFamily` class, I can also move it 
there. Let me know what is best. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93612915
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family{
+
+val name: String = variancePower match {
+  case 0.0 => "gaussian"
+  case 1.0 => "poisson"
+  case 2.0 => "gamma"
+  case default => "tweedie"
+}
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
+  Poisson and Gamma, the canonical link is rarely used. Set Log as the 
default link.
+*/
+val defaultLink: Link = variancePower match {
+  case 0.0 => Identity
+  case 1.0 => Log
+  case 2.0 => Inverse
+  case _ => Log
+}
 
-val defaultLink: Link = Identity
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of the specified $name 
distribution " +
+  s"should be non-negative, but got $y")
+  } else if (variancePower >= 2.0) {
+require(y > 0.0, s"The response variable of the specified $name 
distribution " +
+  s"should be non-negative, but got $y")
+  }
+  if (y == 0) delta else y
+}
 
-override def initialize(y: Double, weight: Double): Double = y
+override def variance(mu: Double): Double = {
--- End diff --

Instead of case statements like this, why not just override in subclasses?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93612965
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family{
+
+val name: String = variancePower match {
+  case 0.0 => "gaussian"
+  case 1.0 => "poisson"
+  case 2.0 => "gamma"
+  case default => "tweedie"
+}
+/*
+  The canonical link is 1 - variancePower. Except for the special 
cases of Gaussian,
+  Poisson and Gamma, the canonical link is rarely used. Set Log as the 
default link.
+*/
+val defaultLink: Link = variancePower match {
+  case 0.0 => Identity
+  case 1.0 => Log
+  case 2.0 => Inverse
+  case _ => Log
+}
 
-val defaultLink: Link = Identity
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower >= 1.0 && variancePower < 2.0) {
+require(y >= 0.0, s"The response variable of the specified $name 
distribution " +
+  s"should be non-negative, but got $y")
+  } else if (variancePower >= 2.0) {
+require(y > 0.0, s"The response variable of the specified $name 
distribution " +
+  s"should be non-negative, but got $y")
+  }
+  if (y == 0) delta else y
+}
 
-override def initialize(y: Double, weight: Double): Double = y
+override def variance(mu: Double): Double = {
+  variancePower match {
+case 0.0 => 1.0
+case 1.0 => mu
+case 2.0 => mu * mu
+case default => math.pow(mu, default)
+  }
+}
 
-override def variance(mu: Double): Double = 1.0
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  if (p == 0) {
+math.log(y / mu)
+  } else {
+(math.pow(y, p) - math.pow(mu, p)) / p
+  }
+}
 
 override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
-  weight * (y - mu) * (y - mu)
+  // Force y >= delta for Poisson or compound Poisson
+  val y1 = if (variancePower >= 1.0 && variancePower < 2.0) 
math.max(y, delta) else y
+  2.0 * weight *
+(y * yp(y1, mu, 1.0 - variancePower) - yp(y, mu, 2.0 - 
variancePower))
 }
 
-override def aic(
-predictions: RDD[(Double, Double, Double)],
-deviance: Double,
-numInstances: Double,
-weightSum: Double): Double = {
-  val wt = predictions.map(x => math.log(x._3)).sum()
-  numInstances * (math.log(deviance / numInstances * 2.0 * math.Pi) + 
1.0) + 2.0 - wt
+override def aic(predictions: RDD[(Double, Double, Double)],
--- End diff --

Likewise there's not a lot of value in pushing 4 separate 

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93612097
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
+"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog,
+"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt,
+"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log,
+"tweedie" -> Identity, "tweedie" -> Log
   )
 
   /** Set of family names that GeneralizedLinearRegression supports. */
-  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1.name)
+  private[regression] lazy val supportedFamilyNames = 
supportedFamilyAndLinkPairs.map(_._1)
 
   /** Set of link names that GeneralizedLinearRegression supports. */
   private[regression] lazy val supportedLinkNames = 
supportedFamilyAndLinkPairs.map(_._2.name)
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Constant used in initialization and deviance to avoid numerical 
issues. */
+  private[regression] val delta: Double = 0.1
--- End diff --

This should still be in an `object` IMHO; it's a constant right? `epsilon` 
really should be too. It's not a big deal but not quite right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93613064
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
   }
 
   private[regression] object Family {
 
 /**
- * Gets the [[Family]] object from its name.
+ * Gets the [[Family]] object based on family and variancePower.
+ * 1) retrieve object based on family name
+ * 2) if family name is tweedie, retrieve object based on variancePower
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param model a GenerealizedLinearRegressionBase object
  */
-def fromName(name: String): Family = {
-  name match {
-case Gaussian.name => Gaussian
-case Binomial.name => Binomial
-case Poisson.name => Poisson
-case Gamma.name => Gamma
+def fromModel(model: GeneralizedLinearRegressionBase): Family = {
+  model.getFamily match {
+case "gaussian" => Gaussian
+case "binomial" => Binomial
+case "poisson" => Poisson
+case "gamma" => Gamma
+case "tweedie" =>
+  model.getVariancePower match {
+case 0.0 => Gaussian
+case 1.0 => Poisson
+case 2.0 => Gamma
+case default => new TweedieFamily(default)
+  }
   }
 }
   }
 
   /**
-   * Gaussian exponential family distribution.
-   * The default link for the Gaussian family is the identity link.
-   */
-  private[regression] object Gaussian extends Family("gaussian") {
+* Tweedie exponential family distribution.
+* This includes the special cases of Gaussian, Poisson and Gamma.
+*/
+  private[regression] class TweedieFamily(private val variancePower: 
Double)
+extends Family{
+
+val name: String = variancePower match {
--- End diff --

Why remove the name and switch like this? you can instead adjust this so 
that subclasses override a `name` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93565567
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

@yanboliang Could you help me understand the issue caused by using string? 
If I use object, then I have to create a Tweedie object that is not used 
anywhere else. And also I have to write two methods in `Family`: one returns 
the global Tweedie object (where the variancePower is preset) and one returns 
the a TweedieFamily object created using the user-specified variancePower. I 
hope we are fine using string since there are only a few values. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93565335
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
--- End diff --

changed tweedie. but other docs have been using Param..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93488131
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = if ($(family) == "tweedie") {
+  new Tweedie($(varPower))
--- End diff --

Yes of course. Given how the code is structured, one straightforward 
solution is to generalize `Family` so that its operations take a reference to 
the model, so that implementation may access its parameters.

Another is to make the code instantiate `Family` subclasses instead of 
using single `object`s and give the instance a reference to the model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93486159
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

@yanboliang  The member object (in `GeneralizedLinearRegressionBase`) won't 
be accessible in `Family`, right? The method `Family.fromName($(family))` uses 
global objects like `Poisson`, `Gamma` etc. To use  `Family.fromName`, I need 
to create a `Tweedie` global object. Then we are back to the issue that @srowen 
pointed out of setting `variancePower` of the global object. Please advise. 
Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93482978
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = if ($(family) == "tweedie") {
+  new Tweedie($(varPower))
--- End diff --

`Family` is not subclass of `GeneralizedLinearRegression`. Could you 
elaborate how to make it get the `variancePower` value? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93420832
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
+/** Constant added to y = 0 for initialization or deviance to avoid 
numerical issues. */
+val delta: Double = 0.1
   }
 
   private[regression] object Family {
 
 /**
  * Gets the [[Family]] object from its name.
+ * This does not work for the tweedie family as it depends on the 
variance power
+ * that is set by the user.
  *
- * @param name family name: "gaussian", "binomial", "poisson" or 
"gamma".
+ * @param name family name: "gaussian", "binomial", "poisson" and 
"gamma".
--- End diff --

Nite: revert this, because it's really an "or"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93420812
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = if ($(family) == "tweedie") {
+  new Tweedie($(varPower))
--- End diff --

Hm, why does this parameter need to be in the `Family` object at all? can't 
the implementation of Tweedie just go get the parameter's value? it's odd to 
have a Family representing all but one family, because Tweedie is one of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93420417
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported value: (1, 2) and (2, Inf).
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val varPower: Param[Double] = new Param(this, "varPower",
+"The power in the variance function of the Tweedie distribution which 
characterizes " +
+"the relationship between the variance and mean of the distribution. " 
+
+"Used only for the tweedie family. Supported value: (1, 2) and (2, 
Inf).",
+(x: Double) => if (x > 1.0 && x != 2.0) true else false)
--- End diff --

You can just write `=> x > 1.0 && x != 2.0`. `if (x) true else false` is 
redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93420653
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 /** Trim the fitted value so that it will be in valid range. */
 def project(mu: Double): Double = mu
+
+/** Constant added to y = 0 for initialization or deviance to avoid 
numerical issues. */
+val delta: Double = 0.1
--- End diff --

This should be defined in an `object`; it's a static constant. The comment 
isn't quite accurate; it's not added, but it's a minimum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93455400
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
+  } else {
+require(y > 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+y
+  }
+}
+
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
+
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  (math.pow(y, p) - math.pow(mu, p)) / p
+}
+
+// Force y >= 0.1 for deviance to work for (1 - variancePower). see 
tweedie()$dev.resid
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2.0 * weight *
+(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 
- variancePower))
+}
+
+// This depends on the density of the tweedie distribution. Not yet 
implemented.
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  0.0
+}
+
+override def project(mu: Double): Double = {
+  if (mu < epsilon) {
+epsilon
+  } else if (mu.isInfinity) {
+Double.MaxValue
--- End diff --

I see, it's done that way in other implementations. OK. I'm not sure if 
it's going to do much.

I think there's a problem in the Gaussian project method because it uses 
Double.MinValue to appear to mean "the smallest double" when it's the "smallest 
possible double" I'll investigate and file a bug if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93420479
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
--- End diff --

Nits: Param -> parameter, tweedie -> Tweedie (two lines below).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93449402
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported value: (1, 2) and (2, Inf).
--- End diff --

Question: Why we don't allow ```0, 1 and 2```? They correspond respectively 
to ```Gaussian, Poisson and Gamma``` families, I think we should support 
fitting a poisson GLM via the ```tweedie``` family entrance and R can do it:
```
y <- rgamma(20,shape=5)
x <- 1:20
glm(y~x,family=tweedie(var.power=1,link.power=1))
glm(y~x,family=poisson(link=identity))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93453273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   /** Set of family and link pairs that GeneralizedLinearRegression 
supports. */
   private[regression] lazy val supportedFamilyAndLinkPairs = Set(
-Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
-Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
-Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
-Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
+"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
--- End diff --

String is error-prone, I think we can construct a member object for 
```Tweedie``` whose ```variancePower``` is the default value(1.5). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93447628
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -64,6 +64,27 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie 
distribution which provides
+   * the relationship between the variance and mean of the distribution.
+   * Used only for the tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported value: (1, 2) and (2, Inf).
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val varPower: Param[Double] = new Param(this, "varPower",
--- End diff --

I vote to revert this back to ```variancePower``` to follow MLlib's 
convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93290854
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") 
(@Since("2.0.0") override val
   def setLinkPredictionCol(value: String): this.type = 
set(linkPredictionCol, value)
 
   override protected def train(dataset: Dataset[_]): 
GeneralizedLinearRegressionModel = {
-val familyObj = Family.fromName($(family))
+val familyObj = Family.fromName($(family), $(variancePower))
--- End diff --

I don't think we can do this either. variancePower is specific to one 
family, not a property of all of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93291042
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

I think the Tweedie implementation needs to be able to access parameters of 
the GLM, to read off variancePower.

As it is this is a global variable and two jobs would overwrite each 
others' values. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93290858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
--- End diff --

I have not seen a formal justification for the choice of 0.1 in R. This 
seminal 
[paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) 
suggests 1/6 (about 0.17) to be the best constant. I would prefer to be 
consistent with R so that we can make comparison. Using a constant is a good 
idea. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93289668
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

Would you please suggest a better way to set the variancePower? I want to 
be consistent with the existing code to have the `Family` objects, but I need 
to also pass on the input `variancePower` to the `Tweedie` object which is used 
to compute the variance function. Any suggestion will be highly appreciated. 
@srowen @yanboliang   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215641
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
--- End diff --

If we're going to use this magic 0.1 constant in many places, factor out a 
constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's 
implementation uses for whatever reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
+  } else {
+require(y > 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+y
+  }
+}
+
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
+
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  (math.pow(y, p) - math.pow(mu, p)) / p
+}
+
+// Force y >= 0.1 for deviance to work for (1 - variancePower). see 
tweedie()$dev.resid
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2.0 * weight *
+(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 
- variancePower))
+}
+
+// This depends on the density of the tweedie distribution. Not yet 
implemented.
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  0.0
--- End diff --

Throw a UnsupportedOperationException?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93216003
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
--- End diff --

This is a global shared variable -- we really can't do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16344#discussion_r93215941
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
   }
 
   /**
+* Tweedie exponential family distribution.
+* The default link for the Tweedie family is the log link.
+*/
+  private[regression] object Tweedie extends Family("tweedie") {
+
+val defaultLink: Link = Log
+
+var variancePower: Double = 1.5
+
+override def initialize(y: Double, weight: Double): Double = {
+  if (variancePower > 1.0 && variancePower < 2.0) {
+require(y >= 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+math.max(y, 0.1)
+  } else {
+require(y > 0.0, "The response variable of the specified Tweedie 
distribution " +
+  s"should be non-negative, but got $y")
+y
+  }
+}
+
+override def variance(mu: Double): Double = math.pow(mu, variancePower)
+
+private def yp(y: Double, mu: Double, p: Double): Double = {
+  (math.pow(y, p) - math.pow(mu, p)) / p
+}
+
+// Force y >= 0.1 for deviance to work for (1 - variancePower). see 
tweedie()$dev.resid
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2.0 * weight *
+(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 
- variancePower))
+}
+
+// This depends on the density of the tweedie distribution. Not yet 
implemented.
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  0.0
+}
+
+override def project(mu: Double): Double = {
+  if (mu < epsilon) {
+epsilon
+  } else if (mu.isInfinity) {
+Double.MaxValue
--- End diff --

Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the 
time you get there a lot of stuff is going to be infinite or not meaningful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-19 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16344

[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the 
GeneralizedLinearRegression model. The Tweedie family is characterized by a 
power variance function. Currently supported distributions such as Gaussian, 
Poisson and Gamma families are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah 

I propose to add support for the other distributions:
- compound Poisson: 1 < variancePower < 2. This one is widely used to model 
zero-inflated continuous distributions, e.g., in insurance, finance, ecology, 
meteorology, advertising etc.
- positive stable: variancePower > 2 and variancePower != 3. Used to model 
extreme values.
- inverse Gaussian: variancePower = 3.

The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc.

Changes made:
- Allow `tweedie` in family. Only `identity` and `log` links are allowed 
for now. 
- Add `variancePower` to `GeneralizedLinearRegressionBase`, which takes 
values in (1, 2) and [3, infty). Also set default value to 1.5 and add getter 
method.
- `Family.fromName` has a second argument `variancePower`
- Add `Tweedie` object
- Add tests for tweedie GLM

Note:
- In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 
0. This is the same as in R: `tweedie()$dev.res`
- `aic` is not supported in this PR because the evaluation of the [Tweedie 
density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in 
these cases are non-trivial. I will implement the density approximation method 
in a future PR.  R returns `null` (see `tweedie()$aic`).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark tweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16344


commit 952887e485fb0d5fa669b3b4c9289b8069ee7769
Author: actuaryzhang 
Date:   2016-12-16T00:50:51Z

Add Tweedie family to GLM

commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3
Author: actuaryzhang 
Date:   2016-12-19T22:50:02Z

Fix calculation in dev resid; Add test for different var power

commit 7fe39106332663d3671b94a8ffac48ca61c48470
Author: actuaryzhang 
Date:   2016-12-19T23:14:37Z

Merge test into GLR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org