[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 Could somebody help review this PR? I think this will make gathering the estimation results in Scala much easier. This will also be helpful in constructing the tests. For example, the GLM

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-09 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah @imatiach-msft Please review the new commit. Main changes: - Fix issue in null deviance calculation in the presence of offset. Except for special cases (Gaussian with

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Yes, that is lots of work. However, the only critical change (since the last commit) is on the calculation of the null deviance. The other changes are mainly because of updating

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100581520 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1218,16 +1266,35 @@ class

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Thanks much for your review. Regarding prediction, both R and my implementation here allow prediction with offsets. If the users want to get the predicted rates (instead of

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah The predict method can work with new data in R. See below. Shall we focus on the current implementation, instead of discussing the details of the R behavior? :) Let me know if

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100974891 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -798,77 +798,160 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100974912 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -798,77 +798,160 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975164 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -168,6 +179,7 @@ private[regression] trait

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -406,6 +435,14 @@ object

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -944,15 +981,27 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100976709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1139,54 +1189,52 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100976816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala --- @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Thanks much for your review. I've made a new commit that addressed all your comments. Please see my inline comments. Let me know if there is any other suggestions. Thanks. -

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158362 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158825 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159146 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159255 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160069 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160217 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @felixcheung @imatiach-msft Thanks much for the review. Made most changes suggested. Please see my inline replies. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-02-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Thanks for the discussions. Will work on this in two weeks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822971 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822942 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @imatiach-msft @felixcheung I cleaned up the tests as suggested, and also updated the R GLM wrapper to use the result from this PR. Please let me know if there is any other suggestions

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @imatiach-msft I'm not sure the R^2s are used much in the GLM context. The deviance, loglikelihood and AIC/BICs are most often used for ANOVA and model comparison. The GLM [book](

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @felixcheung Could you take another look at this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Is there anything else you would recommend for this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #17005: [SPARK-14659][ML] RFormula supports setting base ...

2017-02-20 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17005 [SPARK-14659][ML] RFormula supports setting base level both by frequency and alphabetically ## What changes were proposed in this pull request? Current RFormula drops the least frequent

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 @srowen @jkbradley @felixcheung @sethah @yanboliang One question is: is it better to move the `HasStringOrderType` trait to the shared params? This is only used by StringIndexer and

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 Cannot figure out what's exactly causing the test to fail. Error message is not informative. Any help please? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 @HyukjinKwon Thanks. I'll try retesting this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #17017: [SPARK-19682][SparkR] Issue warning (or error) wh...

2017-02-21 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17017 [SPARK-19682][SparkR] Issue warning (or error) when subset method "[[" takes vector index ## What changes were proposed in this pull request? The `[[` method is supposed to tak

[GitHub] spark issue #17017: [SPARK-19682][SparkR] Issue warning (or error) when subs...

2017-02-21 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17017 @felixcheung Simple example to illustrate this ``` df <- suppressWarnings(createDataFrame(iris)) df[[1:2]] ``` Instead of issuing warning and taking the first elem

[GitHub] spark pull request #17017: [SPARK-19682][SparkR] Issue warning (or error) wh...

2017-02-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17017#discussion_r102648326 --- Diff: R/pkg/R/DataFrame.R --- @@ -1776,6 +1780,10 @@ setMethod("[[", signature(x = "SparkDataFrame", i = "numericOrc

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003515 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -99,37 +95,23 @@ private[r] object

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103004564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1173,33 @@ class

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565335 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -64,6 +64,27 @@ private[regression] trait

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93565567 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,14 +341,15 @@ object

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-21 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang Thanks much for the feedback. I now have a better understanding of the code and the issue. I have made new commits reflecting your suggestions. The major changes are

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen Thanks for the comments. Makes lots of sense to move the switch to subclass. I did not know one could override a `val`. In the new commit, I have moved the `defaultLink` and

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93672741 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -303,20 +337,24 @@ object

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks much for the detailed comments. I have addressed all of them in the new commits. Please take another look. @srowen --- If your project is set up for it, you can

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-26 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang Any additional issues regarding this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2016-12-27 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen Made a new commit according to your suggestion. Everything looking good now? @yanboliang --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Did you get a chance to take another look at this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the detailed review. I have made all changes you suggested except for the part on the new power link function. Yes, the canonical link in the Tweedie in general is `1.0

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849501 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -158,6 +183,16 @@ class

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849540 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -365,7 +401,6 @@ object

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r94849556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -397,32 +432,121 @@ object

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the feedback. However, I'm not sure why we need to be consistent with R on this one. The usage of 'tweedie' glm almost always uses `link.power = 0, 1

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @srowen @yanboliang I'm closing this PR since it does not seem to be very clean to integrate into the current GLM setup. I appreciate all the comments and discussions. --- If

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang
GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Sorry about closing this prematurely. I'm giving it another shot and I think I have an elegant solution to include `linkPower`. The new commit adds the following: 1. It implement

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-11 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks. Look forward to your feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061873 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -613,25 +758,67 @@ object

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r96061883 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,9 +316,9 @@ class

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 @yanboliang Thanks for the review and comments. I have made a new commit that addressed all your comments. The main change is the new companion object `FamilyAndLink` and factory methods to

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-03-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17084 Thanks for the PR. I think this is helpful. Will take a look next week. Quite swamped recently. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR...

2017-04-06 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17553 [SPARK-20026][Doc] Add Tweedie example for SparkR in programming guide ## What changes were proposed in this pull request? Add Tweedie example for SparkR in programming guide. The doc

[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...

2017-04-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17553 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17103: [Minor][Doc] Update GLM doc to include tweedie di...

2017-02-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17103 [Minor][Doc] Update GLM doc to include tweedie distribution Update GLM documentation to include the Tweedie distribution. #16344 @jkbradley @yanboliang You can merge this pull

[GitHub] spark pull request #17105: [SPARK-19773][SparkR] SparkDataFrame should not a...

2017-02-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17105 [SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names ## What changes were proposed in this pull request? SparkDataFrame in SparkR seems to accept duplicate names at

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung Ahh, it seems that we have some conflicting design issues. 1. From the test in collect() and crossJoin, it seems to allow dup names in SparkDataFrame by design

[GitHub] spark pull request #17105: [SPARK-19773][SparkR] SparkDataFrame should not a...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/17105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung Thanks for the clarification. I will close this then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #17115: [Doc][Minor] Update R doc

2017-03-01 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17115 [Doc][Minor] Update R doc Update R doc: 1. columns, names and colnames returns a vector of strings, not **list** as in current doc. 2. `colnames<-` does allow the subset assignm

[GitHub] spark issue #17115: [Doc][Minor] Update R doc

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @felixcheung I see lots of the SparkDataFrame methods use the following in examples: ``` path <- "path/to/file.json" df <- read.json(path) ``` I'

[GitHub] spark issue #17115: [Doc][Minor][SparkR] Update SparkR doc for names, column...

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @HyukjinKwon Thanks. Updated title. @felixcheung Updated doc and added tests. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17115: [Doc][Minor][SparkR] Update SparkR doc for names, column...

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @srowen @felixcheung Thanks for the clarification. I will open another PR to add real data examples for the SparkDataFrame methods. I have seen lots of R package document the

[GitHub] spark pull request #17159: [SPARK-19818][SparkR] union should check for name...

2017-03-03 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17159 [SPARK-19818][SparkR] union should check for name consistency of input data frames ## What changes were proposed in this pull request? Added checks for name consistency of input data

[GitHub] spark issue #17159: [SPARK-19818][SparkR] union should check for name consis...

2017-03-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 The current implementation accepts data frames with different schemas. See issues below: ``` df <- createDataFrame(data.frame(name = c("Michael", "Andy", "

[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

2017-03-04 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17161 [SPARK-19819][SparkR] Use concrete data in SparkR DataFrame examples ## What changes were proposed in this pull request? Many examples in SparkDataFrame methods uses: ``` path

[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

2017-03-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17161 I think most examples in R packages are (supposed to be) runnable. Coming from a user perspective, I find it useful if I can run the examples directly and see what the function does in action

[GitHub] spark issue #17159: [SPARK-19818][SparkR] union should check for name consis...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 @felixcheung OK, did not know it was by design. It does seem that the `union` behavior is similar to R's SQL (in `sqldf`), but as you pointed out, the overload method `rbind` is diff

[GitHub] spark issue #17159: [SPARK-19818][SparkR] rbind should check for name consis...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 Makes sense. Made changes to rbind and added tests. Please take a look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #17159: [SPARK-19818][SparkR] rbind should check for name...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17159#discussion_r104335939 --- Diff: R/pkg/R/DataFrame.R --- @@ -2685,7 +2686,8 @@ setMethod("unionAll", #' Union two or more SparkDataFrames #

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Sorry for taking so long for this update. I think your first suggestion makes most sense, i.e., we do not expose the internal `tweedie`. When `statmod` is loaded

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Yes, the SparkR `tweedie` is not exported. See below. ``` model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species, + fam

[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17146 Will take a look tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17146 This looks good to me. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLin...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17146#discussion_r104853968 --- Diff: python/pyspark/ml/tests.py --- @@ -1223,6 +1223,26 @@ def test_apply_binary_term_freqs(self

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Could you take a look at this new fix when you get a chance? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung If we go with # 3, do we still want to compatibility with statmod::tweedie? It's confusing to have two different ways of specifying the same model. --- If your project i

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung OK, new implementation of # 3. Now works in two ways: 1. `family = "tweedie"` + `variancePower` + `linkPower` 2. When `statmod` is available, `tweedie()`

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 One other change I could make is to change `variancePower` and `linkPower` to `var.power` and `link.power` to be consistent with `statmod`. But l would like to get your feedback on this new

[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16729#discussion_r105576051 --- Diff: R/pkg/R/mllib_regression.R --- @@ -100,6 +120,12 @@ setMethod("spark.glm", signature(data = "SparkDataFrame"

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Thanks for the feedback. Made a new commit that 1. change `variancePower` and `linkPower` to `var.power` and `link.power`. 2. use `link = NULL` for tweedie family

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 Sorry that I forgot to address that comment. Fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Could you merge this please? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/15683 @srowen @thunterdb I just updated the unit test for poisson GLM (only for the log link). The simulated data are now forced to take values of zero. Existing data generation is not

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/15683 @sethah Thanks for the review and comments. I now created a separate unit test. It also passed the style test. I accidentally merged master into a branch... and don't know h

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-11 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/15683 @sethah Thanks for your review and suggestion. I have made a new commit reflecting your comments. @srowen Thanks for all the suggestions. When do you think this change could be

  1   2   3   4   5   6   >