[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-04-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @yanboliang @sethah Any suggestion on moving this PR forward? Appreciate your comments and reviews. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #17590: [SPARK-20278][R] Disable 'multiple_dots_linter' lint rul...

2017-04-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17590 The change is fine to me. With this, we can define function/argument names using multiple styles such as `as.json.array`, `as_json_array`, `asJsonArray`. Is there a preferred style among

[GitHub] spark pull request #17571: [SPARK-20258][Doc][SparkR] Fix SparkR logistic re...

2017-04-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17571#discussion_r110457676 --- Diff: examples/src/main/r/ml/glm.R --- @@ -44,8 +44,9 @@ gaussianGLM2 <- glm(label ~ features, gaussianDF, family = "gaussian")

[GitHub] spark pull request #17571: [SPARK-20258][Doc][SparkR] Fix SparkR logistic re...

2017-04-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17571#discussion_r110457466 --- Diff: examples/src/main/r/ml/glm.R --- @@ -44,8 +44,9 @@ gaussianGLM2 <- glm(label ~ features, gaussianDF, family = "gaussian")

[GitHub] spark issue #17571: [SPARK-20258][Doc][SparkR] Fix SparkR logistic regressio...

2017-04-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17571 @felixcheung Just noticed that the current example for logistic regression in the programming guide did not seem to be a good one. It did not converge using IRWLS, and Quasi-Newton

[GitHub] spark pull request #17571: [SPARK-20258][Doc][SparkR] Fix SparkR logistic re...

2017-04-07 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17571 [SPARK-20258][Doc][SparkR] Fix SparkR logistic regression example in programming guide (did not converge) ## What changes were proposed in this pull request? SparkR logistic

[GitHub] spark issue #17553: [SPARK-20026][Doc][SparkR] Add Tweedie example for Spark...

2017-04-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17553 Issues fixed. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...

2017-04-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17553 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR...

2017-04-06 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17553 [SPARK-20026][Doc] Add Tweedie example for SparkR in programming guide ## What changes were proposed in this pull request? Add Tweedie example for SparkR in programming guide. The doc

[GitHub] spark issue #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-03-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17084 Thanks for the PR. I think this is helpful. Will take a look next week. Quite swamped recently. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Could you merge this please? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 Sorry that I forgot to address that comment. Fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Thanks for the feedback. Made a new commit that 1. change `variancePower` and `linkPower` to `var.power` and `link.power`. 2. use `link = NULL` for tweedie family

[GitHub] spark pull request #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for Spa...

2017-03-12 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16729#discussion_r105576051 --- Diff: R/pkg/R/mllib_regression.R --- @@ -100,6 +120,12 @@ setMethod("spark.glm", signature(data = "SparkDataFrame"

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 One other change I could make is to change `variancePower` and `linkPower` to `var.power` and `link.power` to be consistent with `statmod`. But l would like to get your feedback on this new

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung OK, new implementation of # 3. Now works in two ways: 1. `family = "tweedie"` + `variancePower` + `linkPower` 2. When `statmod` is available, `tweedie()`

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung If we go with # 3, do we still want to compatibility with statmod::tweedie? It's confusing to have two different ways of specifying the same model. --- If your project is set

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Could you take a look at this new fix when you get a chance? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLin...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17146#discussion_r104853968 --- Diff: python/pyspark/ml/tests.py --- @@ -1223,6 +1223,26 @@ def test_apply_binary_term_freqs(self

[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17146 This looks good to me. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...

2017-03-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17146 Will take a look tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Yes, the SparkR `tweedie` is not exported. See below. ``` model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species, + fam

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Sorry for taking so long for this update. I think your first suggestion makes most sense, i.e., we do not expose the internal `tweedie`. When `statmod` is loaded

[GitHub] spark pull request #17159: [SPARK-19818][SparkR] rbind should check for name...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17159#discussion_r104335939 --- Diff: R/pkg/R/DataFrame.R --- @@ -2685,7 +2686,8 @@ setMethod("unionAll", #' Union two or more SparkDataFrames #' -#'

[GitHub] spark issue #17159: [SPARK-19818][SparkR] rbind should check for name consis...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 Makes sense. Made changes to rbind and added tests. Please take a look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #17159: [SPARK-19818][SparkR] union should check for name consis...

2017-03-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 @felixcheung OK, did not know it was by design. It does seem that the `union` behavior is similar to R's SQL (in `sqldf`), but as you pointed out, the overload method `rbind` is different

[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...

2017-03-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17161 I think most examples in R packages are (supposed to be) runnable. Coming from a user perspective, I find it useful if I can run the examples directly and see what the function does in action

[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...

2017-03-04 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17161 [SPARK-19819][SparkR] Use concrete data in SparkR DataFrame examples ## What changes were proposed in this pull request? Many examples in SparkDataFrame methods uses: ``` path

[GitHub] spark issue #17159: [SPARK-19818][SparkR] union should check for name consis...

2017-03-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17159 The current implementation accepts data frames with different schemas. See issues below: ``` df <- createDataFrame(data.frame(name = c("Michael", "Andy", "

[GitHub] spark pull request #17159: [SPARK-19818][SparkR] union should check for name...

2017-03-03 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17159 [SPARK-19818][SparkR] union should check for name consistency of input data frames ## What changes were proposed in this pull request? Added checks for name consistency of input data

[GitHub] spark issue #17115: [Doc][Minor][SparkR] Update SparkR doc for names, column...

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @srowen @felixcheung Thanks for the clarification. I will open another PR to add real data examples for the SparkDataFrame methods. I have seen lots of R package document

[GitHub] spark issue #17115: [Doc][Minor][SparkR] Update SparkR doc for names, column...

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @HyukjinKwon Thanks. Updated title. @felixcheung Updated doc and added tests. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #17115: [Doc][Minor] Update R doc

2017-03-01 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17115 @felixcheung I see lots of the SparkDataFrame methods use the following in examples: ``` path <- "path/to/file.json" df <- read.json(path) ``` I'

[GitHub] spark pull request #17115: [Doc][Minor] Update R doc

2017-03-01 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17115 [Doc][Minor] Update R doc Update R doc: 1. columns, names and colnames returns a vector of strings, not **list** as in current doc. 2. `colnames<-` does allow the subset assignm

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung Thanks for the clarification. I will close this then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #17105: [SPARK-19773][SparkR] SparkDataFrame should not a...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/17105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung Ahh, it seems that we have some conflicting design issues. 1. From the test in collect() and crossJoin, it seems to allow dup names in SparkDataFrame by design

[GitHub] spark issue #17105: [SPARK-19773][SparkR] SparkDataFrame should not allow du...

2017-02-28 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17105 @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17105: [SPARK-19773][SparkR] SparkDataFrame should not a...

2017-02-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17105 [SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names ## What changes were proposed in this pull request? SparkDataFrame in SparkR seems to accept duplicate names

[GitHub] spark pull request #17103: [Minor][Doc] Update GLM doc to include tweedie di...

2017-02-28 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17103 [Minor][Doc] Update GLM doc to include tweedie distribution Update GLM documentation to include the Tweedie distribution. #16344 @jkbradley @yanboliang You can merge this pull

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103004564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1173,33 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -99,37 +95,23 @@ private[r] object

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003515 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD

[GitHub] spark pull request #17017: [SPARK-19682][SparkR] Issue warning (or error) wh...

2017-02-22 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17017#discussion_r102648326 --- Diff: R/pkg/R/DataFrame.R --- @@ -1776,6 +1780,10 @@ setMethod("[[", signature(x = "SparkDataFrame", i = "numericOrc

[GitHub] spark issue #17017: [SPARK-19682][SparkR] Issue warning (or error) when subs...

2017-02-21 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17017 @felixcheung Simple example to illustrate this ``` df <- suppressWarnings(createDataFrame(iris)) df[[1:2]] ``` Instead of issuing warning and taking the first elem

[GitHub] spark pull request #17017: [SPARK-19682][SparkR] Issue warning (or error) wh...

2017-02-21 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17017 [SPARK-19682][SparkR] Issue warning (or error) when subset method "[[" takes vector index ## What changes were proposed in this pull request? The `[[` method is supposed to tak

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 @HyukjinKwon Thanks. I'll try retesting this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 Cannot figure out what's exactly causing the test to fail. Error message is not informative. Any help please? --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17005: [SPARK-14659][ML] RFormula supports setting base level b...

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17005 @srowen @jkbradley @felixcheung @sethah @yanboliang One question is: is it better to move the `HasStringOrderType` trait to the shared params? This is only used by StringIndexer

[GitHub] spark pull request #17005: [SPARK-14659][ML] RFormula supports setting base ...

2017-02-20 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17005 [SPARK-14659][ML] RFormula supports setting base level both by frequency and alphabetically ## What changes were proposed in this pull request? Current RFormula drops the least frequent

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Is there anything else you would recommend for this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @felixcheung Could you take another look at this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @imatiach-msft I'm not sure the R^2s are used much in the GLM context. The deviance, loglikelihood and AIC/BICs are most often used for ANOVA and model comparison. The GLM [book](https

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @imatiach-msft @felixcheung I cleaned up the tests as suggested, and also updated the R GLM wrapper to use the result from this PR. Please let me know if there is any other suggestions

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822942 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822971 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-02-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16729 @felixcheung Thanks for the discussions. Will work on this in two weeks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 @felixcheung @imatiach-msft Thanks much for the review. Made most changes suggested. Please see my inline replies. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160217 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160069 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159255 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159146 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158825 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158362 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Thanks much for your review. I've made a new commit that addressed all your comments. Please see my inline comments. Let me know if there is any other suggestions. Thanks

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100976816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala --- @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100976709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1139,54 +1189,52 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -944,15 +981,27 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -406,6 +435,14 @@ object

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100975164 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -168,6 +179,7 @@ private[regression] trait

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100974912 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -798,77 +798,160 @@ class

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100974891 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -798,77 +798,160 @@ class

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah The predict method can work with new data in R. See below. Shall we focus on the current implementation, instead of discussing the details of the R behavior? :) Let me know

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Thanks much for your review. Regarding prediction, both R and my implementation here allow prediction with offsets. If the users want to get the predicted rates (instead

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r100581520 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1218,16 +1266,35 @@ class

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah Yes, that is lots of work. However, the only critical change (since the last commit) is on the calculation of the null deviance. The other changes are mainly because of updating

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-09 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16699 @sethah @imatiach-msft Please review the new commit. Main changes: - Fix issue in null deviance calculation in the presence of offset. Except for special cases (Gaussian

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-02-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16630 Could somebody help review this PR? I think this will make gathering the estimation results in Scala much easier. This will also be helpful in constructing the tests. For example, the GLM

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @jkbradley Thanks much for the review and suggestion. I updated the error message. Please let me know if there's anything else needed for this PR. Thanks. --- If your project is set up

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-07 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 Can one of the admins merge this PR since we have two approvals now? Thanks. @srowen @jkbradley @felixcheung @yanboliang --- If your project is set up for it, you can reply

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for the comments. OK, added more tests to cover all families. It's not possible to test all family and link combination if that's what you mean: the tweedie family

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-06 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah @imatiach-msft Could you take another look and let me know if there are any additional changes needed on this PR? Thanks! --- If your project is set up for it, you can reply

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 @felixcheung I'll be happy to look into this and fix it. Do you want me to open a new JIRA or just create a PR against SPARK-19460? --- If your project is set up for it, you can reply

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 @felixcheung I am in favor of the first proposal to support dot in names. I'm curious why this was not supported yet since we can create DataFrame in Spark with dot in names? ``` val

[GitHub] spark pull request #16794: [SPARK-19452][SparkR] Fix bug in the name assignm...

2017-02-04 Thread actuaryzhang
GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16794 [SPARK-19452][SparkR] Fix bug in the name assignment method ## What changes were proposed in this pull request? The names method fails to check for validity of the assignment values

[GitHub] spark pull request #16794: [SPARK-19452][SparkR] Fix bug in the name assignm...

2017-02-04 Thread actuaryzhang
Github user actuaryzhang closed the pull request at: https://github.com/apache/spark/pull/16794 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99460883 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -89,7 +89,7 @@ private[ml] class

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for the review. Made changes you suggested (except for the nit part). I added more tests although I don't think they are really necessary. The analytical approach is taking

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 @felixcheung @wangmiao1981 I spent quite some time on this b/c I could not replicate the results and all tests on my end worked. Then I updated my local spark with a pull request

[GitHub] spark issue #16799: [SparkR] fix error in vignettes

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16799 Errors can be seen in #16794. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #16799: [SparkR] fix error in vignettes

2017-02-03 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16799 [SparkR] fix error in vignettes ## What changes were proposed in this pull request? Current version has error in vignettes: ``` model <- spark.bisectingKmeans(df, Sepal_Len

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 Not sure why the unit test on kmeans summary failed since nothing was changed there. Also, all unit tests passed on my computer. --- If your project is set up for it, you can reply

[GitHub] spark issue #16794: [SPARK-19452][SparkR] Fix bug in the name assignment met...

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16794 An example illustrating the issue: ``` df <- suppressWarnings(createDataFrame(iris)) # this is error colnames(df) <- NULL # this should report error na

[GitHub] spark pull request #16794: [SPARK-19452][SparkR] Fix bug in the name assignm...

2017-02-03 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16794 [SPARK-19452][SparkR] Fix bug in the name assignment method ## What changes were proposed in this pull request? The names method fails to check for validity of the assignment values

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-03 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r99406022 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,84 @@ class

[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @seth @imatiach-msft Let me know if there is any other changes needed. Thanks much for your review! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99277921 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +744,48 @@ class

[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...

2017-02-02 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99276315 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -335,6 +335,11 @@ class

<    1   2   3   4   5   6   >