[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97227087 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1052,6 +1217,120 @@ class GeneralizedLinearRegressionSuite assert(summary.solver === "irls") } + test("glm summary: tweedie family with weight") { +/* + R code: + + library(statmod) + df <- as.data.frame(matrix(c( +1.0, 1.0, 0.0, 5.0, +0.5, 2.0, 1.0, 2.0, +1.0, 3.0, 2.0, 1.0, +0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE)) + + f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2, --- End diff -- Change ```f``` to ```model```, and add ```summary(model)``` at next line. It's helpful for users to reproduce the result in R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97219484 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -57,30 +57,72 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam final val family: Param[String] = new Param(this, "family", "The name of family which is a description of the error distribution to be used in the " + s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.", -ParamValidators.inArray[String](supportedFamilyNames.toArray)) +ParamValidators.inArray[String](supportedFamilyNames)) /** @group getParam */ @Since("2.0.0") def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the Tweedie family. --- End diff -- Nit: ```Only applicable for "tweedie" family.``` should be better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97217994 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -106,11 +148,24 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType = { -if (isDefined(link)) { - require(supportedFamilyAndLinkPairs.contains( -Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + -s"with ${$(family)} family does not support ${$(link)} link function.") +if ($(family) == "tweedie") { --- End diff -- ```$(family).toLowerCase == "tweedie"```, see #16516, change here and other places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97219567 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -57,30 +57,72 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam final val family: Param[String] = new Param(this, "family", "The name of family which is a description of the error distribution to be used in the " + s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.", -ParamValidators.inArray[String](supportedFamilyNames.toArray)) +ParamValidators.inArray[String](supportedFamilyNames)) /** @group getParam */ @Since("2.0.0") def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the Tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported values: 0 and [1, Inf). + * Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma + * family, respectively. + * + * @group param + */ + @Since("2.2.0") + final val variancePower: DoubleParam = new DoubleParam(this, "variancePower", +"The power in the variance function of the Tweedie distribution which characterizes " + +"the relationship between the variance and mean of the distribution. " + +"Used only for the Tweedie family. Supported values: 0 and [1, Inf).", +(x: Double) => x >= 1.0 || x == 0.0) + + /** @group getParam */ + @Since("2.2.0") + def getVariancePower: Double = $(variancePower) + + /** * Param for the name of link function which provides the relationship * between the linear predictor and the mean of the distribution function. * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * This is used only when family is not "tweedie". The link function for the "tweedie" family + * must be specified through [[linkPower]]. * * @group param */ @Since("2.0.0") final val link: Param[String] = new Param(this, "link", "The name of link function " + "which provides the relationship between the linear predictor and the mean of the " + s"distribution function. Supported options: ${supportedLinkNames.mkString(", ")}", -ParamValidators.inArray[String](supportedLinkNames.toArray)) +ParamValidators.inArray[String](supportedLinkNames)) /** @group getParam */ @Since("2.0.0") def getLink: String = $(link) /** + * Param for the index in the power link function. This is used to specify the link function + * in the Tweedie family. --- End diff -- ```This is used to specify the link function in the Tweedie family.``` -> ```Only applicable for "tweedie" family.``` I think we should highlight that it ONLY takes effect when family == "tweedie". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97219089 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -578,6 +580,169 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: tweedie family against glm") { +/* + R code: + library(statmod) + df <- as.data.frame(matrix(c( +1.0, 1.0, 0.0, 5.0, +0.5, 1.0, 1.0, 2.0, +1.0, 1.0, 2.0, 1.0, +2.0, 1.0, 3.0, 3.0), 4, 4, byrow = TRUE)) + + f1 <- V1 ~ -1 + V3 + V4 + f2 <- V1 ~ V3 + V4 + + for (f in c(f1, f2)) { +for (lp in c(0, 1, -1)) + for (vp in c(1.6, 2.5)) { +model <- glm(f, df, family = tweedie(var.power = vp, link.power = lp)) +print(as.vector(coef(model))) + } + } + [1] 0.1496480 -0.0122283 + [1] 0.1373567 -0.0120673 + [1] 0.3919109 0.1846094 + [1] 0.3684426 0.1810662 + [1] 0.1759887 0.2195818 + [1] 0.1108561 0.2059430 + [1] -1.3163732 0.4378139 0.2464114 + [1] -1.4396020 0.4817364 0.2680088 + [1] -0.7090230 0.6256309 0.3294324 + [1] -0.9524928 0.7304267 0.3792687 + [1] 2.1188978 -0.3360519 -0.2067023 + [1] 2.1659028 -0.3499170 -0.2128286 +*/ +val datasetTweedie = Seq( + Instance(1.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(0.5, 1.0, Vectors.dense(1.0, 2.0)), + Instance(1.0, 1.0, Vectors.dense(2.0, 1.0)), + Instance(2.0, 1.0, Vectors.dense(3.0, 3.0)) +).toDF() + +val expected = Seq( + Vectors.dense(0, 0.149648, -0.0122283), + Vectors.dense(0, 0.1373567, -0.0120673), + Vectors.dense(0, 0.3919109, 0.1846094), + Vectors.dense(0, 0.3684426, 0.1810662), + Vectors.dense(0, 0.1759887, 0.2195818), + Vectors.dense(0, 0.1108561, 0.205943), + Vectors.dense(-1.3163732, 0.4378139, 0.2464114), + Vectors.dense(-1.439602, 0.4817364, 0.2680088), + Vectors.dense(-0.709023, 0.6256309, 0.3294324), + Vectors.dense(-0.9524928, 0.7304267, 0.3792687), + Vectors.dense(2.1188978, -0.3360519, -0.2067023), + Vectors.dense(2.1659028, -0.349917, -0.2128286)) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (fitIntercept <- Seq(false, true); linkPower <- Seq(0.0, 1.0, -1.0)) { + for (variancePower <- Seq(1.6, 2.5)) { --- End diff -- Nit: ``` for (fitIntercept <- Seq(false, true); linkPower <- Seq(0.0, 1.0, -1.0); variancePower <- Seq(1.6, 2.5)) { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97226910 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link or linkPower. + * 1) if family is "tweedie", retrieve object using linkPower + * 2) otherwise, retrieve object based on link name * - * @param name link name: "identity", "logit", "log", - * "inverse", "probit", "cloglog" or "sqrt". + * @param params the parameter map containing link and link power --- End diff -- ```link and link power``` -> ```family, link and linkPower``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97226879 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link or linkPower. --- End diff -- ``` Gets the Link object based on param family, link and linkPower. If param family was set with "tweedie", return or construct link function object according to linkPower; otherwise, return link function object according to link. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97225945 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -308,7 +380,10 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine @Since("2.0.0") override def load(path: String): GeneralizedLinearRegression = super.load(path) - /** Set of family and link pairs that GeneralizedLinearRegression supports. */ + /** + * Set of family and link pairs that GeneralizedLinearRegression supports. --- End diff -- Set of family (except for tweedie) and link pairs that GeneralizedLinearRegression supports. The link function of tweedie family is specified through param "linkPower". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97219666 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -106,11 +148,24 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType = { -if (isDefined(link)) { - require(supportedFamilyAndLinkPairs.contains( -Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + -s"with ${$(family)} family does not support ${$(link)} link function.") +if ($(family) == "tweedie") { + if (isSet(link)) { +logWarning("When family is tweedie, use param linkPower to specify link function. " + + "Setting param link will take no effect.") + } +} else { + if (isSet(linkPower)) { --- End diff -- Here we should add similar check for ```variancePower```, since when ```family != "tweedie"```, both ```variancePower``` and ```linkPower``` should not be set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97218693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -369,6 +446,23 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } } + private[regression] object FamilyAndLink { + +/** + * Constructs the FamilyAndLink object from a parameter map + */ +def apply(params: GeneralizedLinearRegressionBase): FamilyAndLink = { + val familyObj = Family.fromParams(params) + val linkObj = if ((params.getFamily != "tweedie" && params.isDefined(params.link)) || --- End diff -- ```isSet``` is more accurate than ```isDefined``` at here, since there is always no default values for both of them. I knew the original code used ```isDefined```, but it's better we can correct them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97226609 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -409,27 +503,108 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower --- End diff -- Should the following document be better? ``` Gets the Family object based on param family and variancePower. If param family was set with "gaussian", "binomial", "poisson" or "gamma", return the corresponding object directly; otherwise, construct a Tweedie object according to variancePower. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97218286 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -409,27 +503,108 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian --- End diff -- Revert to ```Gaussian.name``` here and bellow, which is less error-prone. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r97227237 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1052,6 +1217,120 @@ class GeneralizedLinearRegressionSuite assert(summary.solver === "irls") } + test("glm summary: tweedie family with weight") { +/* + R code: + + library(statmod) + df <- as.data.frame(matrix(c( +1.0, 1.0, 0.0, 5.0, +0.5, 2.0, 1.0, 2.0, +1.0, 3.0, 2.0, 1.0, +0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE)) + + f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2, + family = tweedie(var.power = 1.6, link.power = 0)) + + Deviance Residuals: +1234 + 0.6210 -0.0515 1.6935 -3.2539 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + V3 -0.4087 0.5205 -0.7850.515 + V4 -0.1212 0.4082 -0.2970.794 + + (Dispersion parameter for Tweedie family taken to be 3.830036) + + Null deviance: 20.702 on 4 degrees of freedom + Residual deviance: 13.844 on 2 degrees of freedom + AIC: NA + + Number of Fisher Scoring iterations: 11 + + residuals(model, type="pearson") --- End diff -- Typos here? I guess you paste irrelevant results for the following ```residuals```, they should be consistent with L1279-L1281. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061883 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) -val linkObj = if (isDefined(link)) { - Link.fromName($(link)) +val familyObj = Family.fromParams(this) +val linkObj = if (isDefined(link) || isDefined(linkPower)) { + Link.fromParams(this) } else { familyObj.defaultLink } --- End diff -- Makes sense. I created a companion object `FamilyAndLink` with a factory method to construct desired `FamilyAndLink` objects from the input param map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061873 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link and linkPower. + * 1) if family is "tweedie", retrieve object using linkPower + * 2) otherwise, retrieve object based on link name * - * @param name link name: "identity", "logit", "log", - * "inverse", "probit", "cloglog" or "sqrt". + * @param params the parameter map containing link and link power */ -def fromName(name: String): Link = { - name match { -case Identity.name => Identity -case Logit.name => Logit -case Log.name => Log -case Inverse.name => Inverse -case Probit.name => Probit -case CLogLog.name => CLogLog -case Sqrt.name => Sqrt +def fromParams(params: GeneralizedLinearRegressionBase): Link = { + if (params.getFamily == "tweedie") { +params.getLinkPower match { + case 0.0 => Log + case 1.0 => Identity + case -1.0 => Inverse + case 0.5 => Sqrt + case others => new PowerLink(others) +} --- End diff -- The last version of the code only allows setting `linkPower` for tweedie, or `link` for non-tweedie. But now since we only give warnings, I need to add the logic for checking the correct specification. In the `FamilyAndLink` companion object, I use the following logic. When family is not tweedie and link is set, or when family is tweedie and linkPower is set, I will use `Link.fromParams` to construct the Link object. Otherwise, use the default link from the corresponding family. So, if the user does not specify `linkPower`, the link object will be the default `new Power(1 - variancePower)` set in the `Tweedie` class. The test for `tweedie family against glm (default power link)` covers this. The reason for not setting a default linkPower in the param map is to mimic the existing behavior: for tweedie with variancePower = 0, this will be the Gaussian with the identity link; for tweedie with variancePower = 1, this will be the Poisson with the log link; etc. ``` val linkObj = if ((params.getFamily != "tweedie" && params.isDefined(params.link)) || (params.getFamily == "tweedie" && params.isDefined(params.linkPower))) { Link.fromParams(params) } else { familyObj.defaultLink } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95994801 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,9 +316,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) -val linkObj = if (isDefined(link)) { - Link.fromName($(link)) +val familyObj = Family.fromParams(this) +val linkObj = if (isDefined(link) || isDefined(linkPower)) { + Link.fromParams(this) } else { familyObj.defaultLink } --- End diff -- This code snippets are used multiple places, could we wrapper them up as a function? ``` def getFamilyAndLinkObj(params: GeneralizedLinearRegressionBase): (Family, Link) = { .. } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95995765 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link and linkPower. + * 1) if family is "tweedie", retrieve object using linkPower + * 2) otherwise, retrieve object based on link name * - * @param name link name: "identity", "logit", "log", - * "inverse", "probit", "cloglog" or "sqrt". + * @param params the parameter map containing link and link power */ -def fromName(name: String): Link = { - name match { -case Identity.name => Identity -case Logit.name => Logit -case Log.name => Log -case Inverse.name => Inverse -case Probit.name => Probit -case CLogLog.name => CLogLog -case Sqrt.name => Sqrt +def fromParams(params: GeneralizedLinearRegressionBase): Link = { + if (params.getFamily == "tweedie") { +params.getLinkPower match { + case 0.0 => Log + case 1.0 => Identity + case -1.0 => Inverse + case 0.5 => Sqrt + case others => new PowerLink(others) +} --- End diff -- There is no default value for ```linkPower```, what happened if users don't set it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95989723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -57,30 +57,72 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam final val family: Param[String] = new Param(this, "family", "The name of family which is a description of the error distribution to be used in the " + s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.", -ParamValidators.inArray[String](supportedFamilyNames.toArray)) +ParamValidators.inArray[String](supportedFamilyNames)) /** @group getParam */ @Since("2.0.0") def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the Tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported values: 0 and [1, Inf). + * Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma + * family, respectively. + * + * @group param + */ + @Since("2.2.0") + final val variancePower: Param[Double] = new Param(this, "variancePower", +"The power in the variance function of the Tweedie distribution which characterizes " + +"the relationship between the variance and mean of the distribution. " + +"Used only for the Tweedie family. Supported values: 0 and [1, Inf).", +(x: Double) => x >= 1.0 || x == 0.0) + + /** @group getParam */ + @Since("2.2.0") + def getVariancePower: Double = $(variancePower) + + /** * Param for the name of link function which provides the relationship * between the linear predictor and the mean of the distribution function. * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * This is used only when family is not "tweedie". The link function for the "tweedie" family + * must be specified through [[linkPower]]. * * @group param */ @Since("2.0.0") final val link: Param[String] = new Param(this, "link", "The name of link function " + "which provides the relationship between the linear predictor and the mean of the " + s"distribution function. Supported options: ${supportedLinkNames.mkString(", ")}", -ParamValidators.inArray[String](supportedLinkNames.toArray)) +ParamValidators.inArray[String](supportedLinkNames)) /** @group getParam */ @Since("2.0.0") def getLink: String = $(link) /** + * Param for the index in the power link function. This is used to specify the link function + * in the Tweedie family. + * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt + * link, respectively. + * + * @group param + */ + @Since("2.2.0") + final val linkPower: Param[Double] = new Param(this, "linkPower", +"The index in the power link function. This is used to specify the link function in the " + +"Tweedie family.", (x: Double) => true) --- End diff -- Remove ```(x: Double) => true``` since there is no validation check for this param. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95991458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -106,11 +148,20 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType = { -if (isDefined(link)) { - require(supportedFamilyAndLinkPairs.contains( -Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + -s"with ${$(family)} family does not support ${$(link)} link function.") +if ($(family) == "tweedie") { + require(!isDefined(link), "The link function for the tweedie family must be " + +"specified using linkPower, not link.") --- End diff -- I don't think we should throw error if users set ```link``` when family is set as ```tweedie```, and a warning log should be okay, like ``` if (isSet(link)) { logWarning("When family is tweedie, use param linkPower to specify link function. " + "Setting param link will take no effect.") } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95992789 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link and linkPower. --- End diff -- and -> or --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95991944 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -106,11 +148,20 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType = { -if (isDefined(link)) { - require(supportedFamilyAndLinkPairs.contains( -Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + -s"with ${$(family)} family does not support ${$(link)} link function.") +if ($(family) == "tweedie") { + require(!isDefined(link), "The link function for the tweedie family must be " + +"specified using linkPower, not link.") +} else { + require(!isDefined(linkPower), s"The link function for the ${$(family)} family " + + "must be specified using link, not linkPower.") --- End diff -- Ditto, ``` if (isSet(linkPower)) { logWarning("When family is not tweedie, use param link to specify link function. " + "Setting param linkPower will take no effect.") } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r95993832 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] object Link { /** - * Gets the [[Link]] object from its name. + * Gets the [[Link]] object based on link and linkPower. + * 1) if family is "tweedie", retrieve object using linkPower + * 2) otherwise, retrieve object based on link name * - * @param name link name: "identity", "logit", "log", - * "inverse", "probit", "cloglog" or "sqrt". + * @param params the parameter map containing link and link power */ -def fromName(name: String): Link = { - name match { -case Identity.name => Identity -case Logit.name => Logit -case Log.name => Log -case Inverse.name => Inverse -case Probit.name => Probit -case CLogLog.name => CLogLog -case Sqrt.name => Sqrt +def fromParams(params: GeneralizedLinearRegressionBase): Link = { + if (params.getFamily == "tweedie") { +params.getLinkPower match { + case 0.0 => Log + case 1.0 => Identity + case -1.0 => Inverse + case 0.5 => Sqrt + case others => new PowerLink(others) +} + } else { +params.getLink match { + case Identity.name => Identity + case Logit.name => Logit + case Log.name => Log + case Inverse.name => Inverse + case Probit.name => Probit + case CLogLog.name => CLogLog + case Sqrt.name => Sqrt +} + } +} + } + + /** Power link function class */ + private[regression] class PowerLink(val linkPower: Double) --- End diff -- ```PowerLink -> Power```, other link objects are not end with ```Link```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < varPower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: varPower > 2 and varPower != 3. Used to model extreme values. - inverse Gaussian: varPower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method. - Add `Tweedie` class - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhangDate: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang Date: 2016-12-19T23:14:37Z Merge test into GLR commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa Author: actuaryzhang Date: 2016-12-20T22:59:05Z Use Tweedie class instead of global object Tweedie; change variancePower to varPower commit a8feea7d8095170c1b5f18b7887f16a6d763e42c Author: actuaryzhang Date: 2016-12-21T23:42:40Z Allow Family to use GLRBase object directly commit 233e2d338be8d36a74eaf578bfea804ae3617d4e Author: actuaryzhang Date: 2016-12-22T01:56:34Z Add TweedieFamily and implement specific distn within Tweedie commit 17c55816c914bc96a8b5141356e3c117f343f303 Author: actuaryzhang Date: 2016-12-22T04:39:54Z Clean up doc commit 0b41825e99020976a34d8fe9c983f26de6c8c40f Author: actuaryzhang Date: 2016-12-22T17:52:01Z Move defaultLink and name to subclass of TweedieFamily commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac Author: actuaryzhang Date: 2016-12-22T18:10:51Z Change style for AIC commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e Author: actuaryzhang Date: 2016-12-23T19:10:20Z Rename Family methods and restore methods for tweedie subclasses commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e Author: actuaryzhang Date: 2016-12-23T19:12:25Z Update test commit 9a71e89f629260c775922901a04c989f36ea4946 Author: actuaryzhang Date: 2016-12-27T17:16:40Z Clean up doc commit f461c09e65360f695ad3092b41bc26e0c61bbd95 Author: actuaryzhang Date: 2016-12-27T22:18:39Z Put delta in Tweedie companion object commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4 Author: actuaryzhang Date: 2016-12-27T22:23:57Z Clean up doc commit fab265278109eede4cce7ee506e8b29d481c4549 Author: actuaryzhang Date: 2017-01-05T19:32:06Z Allow more link functions in tweedie --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working,
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + params.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new Tweedie(default) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849540 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** * A description of the error distribution to be used in the model. * - * @param name the name of the family. --- End diff -- Sorry. added this back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849501 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val setDefault(family -> Gaussian.name) /** +* Sets the value of param [[variancePower]]. +* Used only when family is "tweedie". +* +* @group setParam +*/ + @Since("2.2.0") + def setVariancePower(value: Double): this.type = set(variancePower, value) + setDefault(variancePower -> 1.5) --- End diff -- Done. change default variancePower to 0.0, which will use Gaussian (with default identity link) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94773809 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + params.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new Tweedie(default) + } + } +} + } + + /** +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class Tweedie(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower, which becomes Identity for Gaussian, + Log for Poisson, and Inverse for Gamma. Except for these special cases, + the canonical link is rarely used. For example, the canonical link is 1/Sqrt + when variancePower = 1.5. We set Log as the default link, which may be overridden + in subclasses. +*/ +override val defaultLink: Link = Log --- End diff -- See my above comments for default link. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94762690 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -365,7 +401,6 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** * A description of the error distribution to be used in the model. * - * @param name the name of the family. --- End diff -- Why remove this line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94772856 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + params.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new Tweedie(default) --- End diff -- ```default``` -> ```others``` should be more clear? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94771634 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -158,6 +183,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val setDefault(family -> Gaussian.name) /** +* Sets the value of param [[variancePower]]. +* Used only when family is "tweedie". +* +* @group setParam +*/ + @Since("2.2.0") + def setVariancePower(value: Double): this.type = set(variancePower, value) + setDefault(variancePower -> 1.5) --- End diff -- Why set the default value to 1.5, AFAIK, R set the default ```variancePower``` with 0 which means gaussian family, and identity as default link function. ``` glm(formula = "b ~ .", family = tweedie, data = df, weights = w) ``` produces the same model with ``` glm(formula = "b ~ .", family = gaussian, data = df, weights = w) ``` [h2o.glm](https://rdrr.io/cran/h2o/man/h2o.glm.html) has the consistent default values with R, should we keep consistent with them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94773283 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -128,13 +152,14 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam * Generalized linear model (Wikipedia)) * specified by giving a symbolic description of the linear * predictor (link function) and a description of the error distribution (family). - * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. * Valid link functions for each family is listed below. The first link function of each family * is the default one. * - "gaussian" : "identity", "log", "inverse" * - "binomial" : "logit", "probit", "cloglog" * - "poisson" : "log", "identity", "sqrt" * - "gamma": "inverse", "identity", "log" + * - "tweedie" : "log", "identity" --- End diff -- The default link for tweedie family is identity in R and H2O, I think we should keep consistent with them. See my comments at L193. BTW, we can expose param ```linkPower```(```1.0 - variancePower``` as default value) to support other link functions except for ```log``` and ```identity```. The link functions corresponding to "tweedie" family should be: ``` def link(mu: Double): Double = math.pow(mu, linkPower) ``` I think we should generate ```Link``` object according to the input value of ```linkPower``` if the ```family``` was set with "tweedie", we can follow the way of ```family``` to generate ```link``` object by defining a function like: ``` private[regression] object Link { def fromParams(params: GeneralizedLinearRegressionBase): Link = { .. } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94764683 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params the parameter map containing family name and variance power */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromParams(params: GeneralizedLinearRegressionBase): Family = { + params.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + params.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new Tweedie(default) + } + } +} + } + + /** +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class Tweedie(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower, which becomes Identity for Gaussian, + Log for Poisson, and Inverse for Gamma. Except for these special cases, + the canonical link is rarely used. For example, the canonical link is 1/Sqrt + when variancePower = 1.5. We set Log as the default link, which may be overridden + in subclasses. +*/ +override val defaultLink: Link = Log + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of $name($variancePower) family " + + s"should be non-negative, but got $y") + } else if (variancePower >= 2.0) { +require(y > 0.0, s"The response variable of $name($variancePower) family " + + s"should be non-negative, but got $y") --- End diff -- ```y > 0.0``` means ```positive``` rather than ```non-negative```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93971228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,21 +435,102 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param params a GenerealizedLinearRegressionBase object --- End diff -- typo in GenerealizedLinearRegressionBase; this type of doc doesn't do anything though because the type is already documented. It should say something non-trivial --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93971681 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, +"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog, +"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt, +"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log, +"tweedie" -> Identity, "tweedie" -> Log ) /** Set of family names that GeneralizedLinearRegression supports. */ - private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1.name) + private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1) /** Set of link names that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) private[regression] val epsilon: Double = 1E-16 + /** Constant used in initialization and deviance to avoid numerical issues. */ + private[regression] val delta: Double = 0.1 --- End diff -- Why not a companion object for TweedieFamily though? that just seems easy and more correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93776808 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, + Poisson and Gamma, the canonical link is rarely used. Set Log as the default link. +*/ +override val defaultLink: Link = Log -val defaultLink: Link = Identity +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of the specified distribution " + + s"should be non-negative, but got $y") + } else if (variancePower >= 2.0) { +require(y > 0.0, s"The response variable of the specified distribution " + --- End diff -- Ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93778762 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- Yeah, you are right. It needs an extra global object to avoid error-prone which may a little expensive. I'm ok with using string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93773154 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,28 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the Tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported value: 0 and [1, Inf). Note that when the value of the variance power is + * 0, 1, or 2, the Gaussian, Poisson or Gamma family is used, respectively. + * + * @group param + */ + @Since("2.2.0") + final val variancePower: Param[Double] = new Param(this, "variancePower", +"The power in the variance function of the Tweedie distribution which characterizes " + +"the relationship between the variance and mean of the distribution. " + +"Used for the Tweedie family. Supported value: 0 and [1, Inf).", --- End diff -- ```Used only for``` should be more clear? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93774215 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { --- End diff -- Rename to ```fromParams```, we extract ```family``` and ```variancePower``` from the ```Params``` which is the superclass of GLR estimator and model. And actually we use this function for both estimator(L279) and model(L974). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93779154 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -578,6 +578,100 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: tweedie family against glm") { +/* +R code: --- End diff -- ```library(statmod)``` which can help users to reproduce this test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93778869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, +"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog, +"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt, +"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log, +"tweedie" -> Identity, "tweedie" -> Log ) /** Set of family names that GeneralizedLinearRegression supports. */ - private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1.name) + private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1) /** Set of link names that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) private[regression] val epsilon: Double = 1E-16 + /** Constant used in initialization and deviance to avoid numerical issues. */ + private[regression] val delta: Double = 0.1 --- End diff -- I'm OK to put it here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93775431 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) --- End diff -- ```TweedieFamily``` -> ```Tweedie```, we don't add suffix for other family class/object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93777486 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, + Poisson and Gamma, the canonical link is rarely used. Set Log as the default link. +*/ +override val defaultLink: Link = Log -val defaultLink: Link = Identity +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of the specified distribution " + + s"should be non-negative, but got $y") + } else if (variancePower >= 2.0) { +require(y > 0.0, s"The response variable of the specified distribution " + + s"should be non-negative, but got $y") + } + if (y == 0) delta else y +} -override def initialize(y: Double, weight: Double): Double = y +override def variance(mu: Double): Double = math.pow(mu, variancePower) -override def variance(mu: Double): Double = 1.0 +private def yp(y: Double, mu: Double, p: Double): Double = { + if (p == 0) { +math.log(y / mu) + } else { +(math.pow(y, p) - math.pow(mu, p)) / p + } +} override def deviance(y: Double, mu: Double, weight: Double): Double = { - weight * (y - mu) * (y - mu) + // Force y >= delta for Poisson or compound Poisson + val y1 = if (variancePower >= 1.0 && variancePower < 2.0) { +math.max(y, delta) + } else { +y + } + 2.0 * weight * +(y * yp(y1, mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) } override def aic( predictions: RDD[(Double, Double, Double)], deviance: Double, numInstances: Double, weightSum: Double): Double = { + /* + This depends on the density of the Tweedie distribution. + Only implemented for Gaussian, Poisson and Gamma at this point. + */ + throw new UnsupportedOperationException("No AIC available for the tweedie family") +} + +override def project(mu: Double): Double = { + if (mu < epsilon) { +epsilon + } else if (mu.isInfinity) { +Double.MaxValue + } else { +mu + } +} + } + + /** + * Gaussian exponential family distribution. + * The default link for the Gaussian family is the identity link. + */ + private[regression] object Gaussian extends TweedieFamily(0.0) { --- End
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93776723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, + Poisson and Gamma, the canonical link is rarely used. Set Log as the default link. +*/ +override val defaultLink: Link = Log -val defaultLink: Link = Identity +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of the specified distribution " + --- End diff -- ```The response variable of $name family ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93773858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -128,13 +151,14 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam * Generalized linear model (Wikipedia)) * specified by giving a symbolic description of the linear * predictor (link function) and a description of the error distribution (family). - * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. * Valid link functions for each family is listed below. The first link function of each family * is the default one. * - "gaussian" : "identity", "log", "inverse" * - "binomial" : "logit", "probit", "cloglog" * - "poisson" : "log", "identity", "sqrt" * - "gamma": "inverse", "identity", "log" + * - "tweedie" : "identity", "log" --- End diff -- ```- "tweedie" : "log", "identity"```, see L155: ```the first link function of each family is the default one```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93775763 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,46 +434,118 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family("tweedie") { + +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, --- End diff -- ```The canonical link is 1 - variancePower```, could you clarify this to make it more clear? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93672741 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, +"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog, +"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt, +"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log, +"tweedie" -> Identity, "tweedie" -> Log ) /** Set of family names that GeneralizedLinearRegression supports. */ - private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1.name) + private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1) /** Set of link names that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) private[regression] val epsilon: Double = 1E-16 + /** Constant used in initialization and deviance to avoid numerical issues. */ + private[regression] val delta: Double = 0.1 --- End diff -- They are already in the `GeneralizedLinearRegression` object, aren't they? Or do you mean creating a new object say `Constant` that stores these two constants, and using them like `Constant.delta`? Since `delta` is only used in the `TweedieFamily` class, I can also move it there. Let me know what is best. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93612915 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family{ + +val name: String = variancePower match { + case 0.0 => "gaussian" + case 1.0 => "poisson" + case 2.0 => "gamma" + case default => "tweedie" +} +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, + Poisson and Gamma, the canonical link is rarely used. Set Log as the default link. +*/ +val defaultLink: Link = variancePower match { + case 0.0 => Identity + case 1.0 => Log + case 2.0 => Inverse + case _ => Log +} -val defaultLink: Link = Identity +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of the specified $name distribution " + + s"should be non-negative, but got $y") + } else if (variancePower >= 2.0) { +require(y > 0.0, s"The response variable of the specified $name distribution " + + s"should be non-negative, but got $y") + } + if (y == 0) delta else y +} -override def initialize(y: Double, weight: Double): Double = y +override def variance(mu: Double): Double = { --- End diff -- Instead of case statements like this, why not just override in subclasses? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93612965 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family{ + +val name: String = variancePower match { + case 0.0 => "gaussian" + case 1.0 => "poisson" + case 2.0 => "gamma" + case default => "tweedie" +} +/* + The canonical link is 1 - variancePower. Except for the special cases of Gaussian, + Poisson and Gamma, the canonical link is rarely used. Set Log as the default link. +*/ +val defaultLink: Link = variancePower match { + case 0.0 => Identity + case 1.0 => Log + case 2.0 => Inverse + case _ => Log +} -val defaultLink: Link = Identity +override def initialize(y: Double, weight: Double): Double = { + if (variancePower >= 1.0 && variancePower < 2.0) { +require(y >= 0.0, s"The response variable of the specified $name distribution " + + s"should be non-negative, but got $y") + } else if (variancePower >= 2.0) { +require(y > 0.0, s"The response variable of the specified $name distribution " + + s"should be non-negative, but got $y") + } + if (y == 0) delta else y +} -override def initialize(y: Double, weight: Double): Double = y +override def variance(mu: Double): Double = { + variancePower match { +case 0.0 => 1.0 +case 1.0 => mu +case 2.0 => mu * mu +case default => math.pow(mu, default) + } +} -override def variance(mu: Double): Double = 1.0 +private def yp(y: Double, mu: Double, p: Double): Double = { + if (p == 0) { +math.log(y / mu) + } else { +(math.pow(y, p) - math.pow(mu, p)) / p + } +} override def deviance(y: Double, mu: Double, weight: Double): Double = { - weight * (y - mu) * (y - mu) + // Force y >= delta for Poisson or compound Poisson + val y1 = if (variancePower >= 1.0 && variancePower < 2.0) math.max(y, delta) else y + 2.0 * weight * +(y * yp(y1, mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) } -override def aic( -predictions: RDD[(Double, Double, Double)], -deviance: Double, -numInstances: Double, -weightSum: Double): Double = { - val wt = predictions.map(x => math.log(x._3)).sum() - numInstances * (math.log(deviance / numInstances * 2.0 * math.Pi) + 1.0) + 2.0 - wt +override def aic(predictions: RDD[(Double, Double, Double)], --- End diff -- Likewise there's not a lot of value in pushing 4 separate
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93612097 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, +"binomial" -> Logit, "binomial" -> Probit, "binomial" -> CLogLog, +"poisson" -> Log, "poisson" -> Identity, "poisson" -> Sqrt, +"gamma" -> Inverse, "gamma" -> Identity, "gamma" -> Log, +"tweedie" -> Identity, "tweedie" -> Log ) /** Set of family names that GeneralizedLinearRegression supports. */ - private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1.name) + private[regression] lazy val supportedFamilyNames = supportedFamilyAndLinkPairs.map(_._1) /** Set of link names that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) private[regression] val epsilon: Double = 1E-16 + /** Constant used in initialization and deviance to avoid numerical issues. */ + private[regression] val delta: Double = 0.1 --- End diff -- This should still be in an `object` IMHO; it's a constant right? `epsilon` really should be too. It's not a big deal but not quite right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93613064 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,49 +436,132 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + } private[regression] object Family { /** - * Gets the [[Family]] object from its name. + * Gets the [[Family]] object based on family and variancePower. + * 1) retrieve object based on family name + * 2) if family name is tweedie, retrieve object based on variancePower * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param model a GenerealizedLinearRegressionBase object */ -def fromName(name: String): Family = { - name match { -case Gaussian.name => Gaussian -case Binomial.name => Binomial -case Poisson.name => Poisson -case Gamma.name => Gamma +def fromModel(model: GeneralizedLinearRegressionBase): Family = { + model.getFamily match { +case "gaussian" => Gaussian +case "binomial" => Binomial +case "poisson" => Poisson +case "gamma" => Gamma +case "tweedie" => + model.getVariancePower match { +case 0.0 => Gaussian +case 1.0 => Poisson +case 2.0 => Gamma +case default => new TweedieFamily(default) + } } } } /** - * Gaussian exponential family distribution. - * The default link for the Gaussian family is the identity link. - */ - private[regression] object Gaussian extends Family("gaussian") { +* Tweedie exponential family distribution. +* This includes the special cases of Gaussian, Poisson and Gamma. +*/ + private[regression] class TweedieFamily(private val variancePower: Double) +extends Family{ + +val name: String = variancePower match { --- End diff -- Why remove the name and switch like this? you can instead adjust this so that subclasses override a `name` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565567 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- @yanboliang Could you help me understand the issue caused by using string? If I use object, then I have to create a Tweedie object that is not used anywhere else. And also I have to write two methods in `Family`: one returns the global Tweedie object (where the variancePower is preset) and one returns the a TweedieFamily object created using the user-specified variancePower. I hope we are fine using string since there are only a few values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565335 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides --- End diff -- changed tweedie. but other docs have been using Param.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93488131 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = if ($(family) == "tweedie") { + new Tweedie($(varPower)) --- End diff -- Yes of course. Given how the code is structured, one straightforward solution is to generalize `Family` so that its operations take a reference to the model, so that implementation may access its parameters. Another is to make the code instantiate `Family` subclasses instead of using single `object`s and give the instance a reference to the model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93486159 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- @yanboliang The member object (in `GeneralizedLinearRegressionBase`) won't be accessible in `Family`, right? The method `Family.fromName($(family))` uses global objects like `Poisson`, `Gamma` etc. To use `Family.fromName`, I need to create a `Tweedie` global object. Then we are back to the issue that @srowen pointed out of setting `variancePower` of the global object. Please advise. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93482978 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = if ($(family) == "tweedie") { + new Tweedie($(varPower)) --- End diff -- `Family` is not subclass of `GeneralizedLinearRegression`. Could you elaborate how to make it get the `variancePower` value? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93420832 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + +/** Constant added to y = 0 for initialization or deviance to avoid numerical issues. */ +val delta: Double = 0.1 } private[regression] object Family { /** * Gets the [[Family]] object from its name. + * This does not work for the tweedie family as it depends on the variance power + * that is set by the user. * - * @param name family name: "gaussian", "binomial", "poisson" or "gamma". + * @param name family name: "gaussian", "binomial", "poisson" and "gamma". --- End diff -- Nite: revert this, because it's really an "or" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93420812 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = if ($(family) == "tweedie") { + new Tweedie($(varPower)) --- End diff -- Hm, why does this parameter need to be in the `Family` object at all? can't the implementation of Tweedie just go get the parameter's value? it's odd to have a Family representing all but one family, because Tweedie is one of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93420417 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported value: (1, 2) and (2, Inf). + * + * @group param + */ + @Since("2.2.0") + final val varPower: Param[Double] = new Param(this, "varPower", +"The power in the variance function of the Tweedie distribution which characterizes " + +"the relationship between the variance and mean of the distribution. " + +"Used only for the tweedie family. Supported value: (1, 2) and (2, Inf).", +(x: Double) => if (x > 1.0 && x != 2.0) true else false) --- End diff -- You can just write `=> x > 1.0 && x != 2.0`. `if (x) true else false` is redundant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93420653 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Trim the fitted value so that it will be in valid range. */ def project(mu: Double): Double = mu + +/** Constant added to y = 0 for initialization or deviance to avoid numerical issues. */ +val delta: Double = 0.1 --- End diff -- This should be defined in an `object`; it's a static constant. The comment isn't quite accurate; it's not added, but it's a minimum. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93455400 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) + } else { +require(y > 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +y + } +} + +override def variance(mu: Double): Double = math.pow(mu, variancePower) + +private def yp(y: Double, mu: Double, p: Double): Double = { + (math.pow(y, p) - math.pow(mu, p)) / p +} + +// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2.0 * weight * +(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) +} + +// This depends on the density of the tweedie distribution. Not yet implemented. +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + 0.0 +} + +override def project(mu: Double): Double = { + if (mu < epsilon) { +epsilon + } else if (mu.isInfinity) { +Double.MaxValue --- End diff -- I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much. I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93420479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides --- End diff -- Nits: Param -> parameter, tweedie -> Tweedie (two lines below). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93449402 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported value: (1, 2) and (2, Inf). --- End diff -- Question: Why we don't allow ```0, 1 and 2```? They correspond respectively to ```Gaussian, Poisson and Gamma``` families, I think we should support fitting a poisson GLM via the ```tweedie``` family entrance and R can do it: ``` y <- rgamma(20,shape=5) x <- 1:20 glm(y~x,family=tweedie(var.power=1,link.power=1)) glm(y~x,family=poisson(link=identity)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93453273 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** Set of family and link pairs that GeneralizedLinearRegression supports. */ private[regression] lazy val supportedFamilyAndLinkPairs = Set( -Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse, -Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, -Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, -Gamma -> Inverse, Gamma -> Identity, Gamma -> Log +"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, --- End diff -- String is error-prone, I think we can construct a member object for ```Tweedie``` whose ```variancePower``` is the default value(1.5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93447628 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Used only for the tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported value: (1, 2) and (2, Inf). + * + * @group param + */ + @Since("2.2.0") + final val varPower: Param[Double] = new Param(this, "varPower", --- End diff -- I vote to revert this back to ```variancePower``` to follow MLlib's convention. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93290854 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = Family.fromName($(family), $(variancePower)) --- End diff -- I don't think we can do this either. variancePower is specific to one family, not a property of all of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93291042 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower. As it is this is a global variable and two jobs would overwrite each others' values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93290858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) --- End diff -- I have not seen a formal justification for the choice of 0.1 in R. This seminal [paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93289668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the `Family` objects, but I need to also pass on the input `variancePower` to the `Tweedie` object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215641 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) --- End diff -- If we're going to use this magic 0.1 constant in many places, factor out a constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's implementation uses for whatever reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) + } else { +require(y > 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +y + } +} + +override def variance(mu: Double): Double = math.pow(mu, variancePower) + +private def yp(y: Double, mu: Double, p: Double): Double = { + (math.pow(y, p) - math.pow(mu, p)) / p +} + +// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2.0 * weight * +(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) +} + +// This depends on the density of the tweedie distribution. Not yet implemented. +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + 0.0 --- End diff -- Throw a UnsupportedOperationException? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93216003 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- This is a global shared variable -- we really can't do this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215941 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) + } else { +require(y > 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +y + } +} + +override def variance(mu: Double): Double = math.pow(mu, variancePower) + +private def yp(y: Double, mu: Double, p: Double): Double = { + (math.pow(y, p) - math.pow(mu, p)) / p +} + +// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2.0 * weight * +(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) +} + +// This depends on the density of the tweedie distribution. Not yet implemented. +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + 0.0 +} + +override def project(mu: Double): Double = { + if (mu < epsilon) { +epsilon + } else if (mu.isInfinity) { +Double.MaxValue --- End diff -- Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the time you get there a lot of stuff is going to be infinite or not meaningful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < variancePower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: variancePower > 2 and variancePower != 3. Used to model extreme values. - inverse Gaussian: variancePower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `variancePower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and [3, infty). Also set default value to 1.5 and add getter method. - `Family.fromName` has a second argument `variancePower` - Add `Tweedie` object - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhangDate: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang Date: 2016-12-19T23:14:37Z Merge test into GLR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org