[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15910


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r90215100
  
--- Diff: R/pkg/R/mllib.R ---
@@ -769,11 +768,11 @@ setMethod("predict", signature(object = 
"KMeansModel"),
 #' # multinomial logistic regression
 #'
 #' label <- c(0.0, 1.0, 2.0, 0.0, 0.0)
-#' feature1 <- c(4.845940, 5.64480, 7.430381, 6.464263, 5.555667)
-#' feature2 <- c(2.941319, 2.614812, 2.162451, 3.339474, 2.970987)
-#' feature3 <- c(1.322733, 1.348044, 3.861237, 9.686976, 3.447130)
-#' feature4 <- c(1.3246388, 0.5510444, 0.9225810, 1.2147881, 1.6020842)
-#' data <- as.data.frame(cbind(label, feature1, feature2, feature3, 
feature4))
+#' features1 <- c(4.845940, 5.64480, 7.430381, 6.464263, 5.555667)
+#' features2 <- c(2.941319, 2.614812, 2.162451, 3.339474, 2.970987)
+#' features3 <- c(1.322733, 1.348044, 3.861237, 9.686976, 3.447130)
+#' features4 <- c(1.3246388, 0.5510444, 0.9225810, 1.2147881, 1.6020842)
+#' data <- as.data.frame(cbind(label, features1, features2, features3, 
features4))
--- End diff --

Nit: Actually you should not change it, usually the whole feature column 
were called as ```features```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-28 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r89852751
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala ---
@@ -84,14 +93,12 @@ private[r] object LogisticRegressionWrapper
 
 val rFormula = new RFormula()
   .setFormula(formula)
-RWrapperUtils.checkDataColumns(rFormula, data)
+  .setForceIndexLabel(true)
+checkDataColumns(rFormula, data)
 val rFormulaModel = rFormula.fit(data)
 
-// get feature names from output schema
-val schema = rFormulaModel.transform(data).schema
-val featureAttrs = 
AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol))
-  .attributes.get
-val features = featureAttrs.map(_.name.get)
+// get labels and feature names from output schema
+val (features, labels) = getFeaturesAndLabels(rFormulaModel, data)
 
 // assemble and fit the pipeline
 val logisticRegression = new LogisticRegression()
--- End diff --

OK. Fix it in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-28 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r89852551
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -674,6 +674,16 @@ test_that("spark.logit", {
   expect_error(summary(blr_model2))
   unlink(modelPath)
 
+  # test prediction label as text
+  training <- suppressWarnings(createDataFrame(iris))
+  binomial_training <- training[training$Species %in% c("versicolor", 
"virginica"), ]
+  binomial_model <- spark.logit(binomial_training, Species ~ Sepal_Length 
+ Sepal_Width)
+  prediction <- predict(binomial_model, binomial_training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 
1)$prediction), "character")
+  expected <- c("virginica", "virginica", "virginica", "versicolor", 
"virginica",
+"versicolor", "virginica", "versicolor", "virginica", 
"versicolor")
+  expect_equal(as.list(take(select(prediction, "prediction"), 10))[[1]], 
expected)
--- End diff --

I will try to create follow-up jira for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88777658
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -674,6 +674,16 @@ test_that("spark.logit", {
   expect_error(summary(blr_model2))
   unlink(modelPath)
 
+  # test prediction label as text
+  training <- suppressWarnings(createDataFrame(iris))
+  binomial_training <- training[training$Species %in% c("versicolor", 
"virginica"), ]
+  binomial_model <- spark.logit(binomial_training, Species ~ Sepal_Length 
+ Sepal_Width)
+  prediction <- predict(binomial_model, binomial_training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 
1)$prediction), "character")
+  expected <- c("virginica", "virginica", "virginica", "versicolor", 
"virginica",
+"versicolor", "virginica", "versicolor", "virginica", 
"versicolor")
+  expect_equal(as.list(take(select(prediction, "prediction"), 10))[[1]], 
expected)
--- End diff --

Theoretically, the order is not guaranteed. However, we did similar work 
from the first test case of mllib.R, but never had a problem until now. I'd 
like to enforce the tests here and other places, but may be in a separate work 
should be better since it involves lots of other tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88777125
  
--- Diff: R/pkg/R/mllib.R ---
@@ -736,7 +736,7 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(1.0, 1.0, 1.0, 0.0, 0.0)
+#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
 #' feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
--- End diff --

Usually we name it as ```features```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88777030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala ---
@@ -84,14 +93,12 @@ private[r] object LogisticRegressionWrapper
 
 val rFormula = new RFormula()
   .setFormula(formula)
-RWrapperUtils.checkDataColumns(rFormula, data)
+  .setForceIndexLabel(true)
+checkDataColumns(rFormula, data)
 val rFormulaModel = rFormula.fit(data)
 
-// get feature names from output schema
-val schema = rFormulaModel.transform(data).schema
-val featureAttrs = 
AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol))
-  .attributes.get
-val features = featureAttrs.map(_.name.get)
+// get labels and feature names from output schema
+val (features, labels) = getFeaturesAndLabels(rFormulaModel, data)
 
 // assemble and fit the pipeline
 val logisticRegression = new LogisticRegression()
--- End diff --

off-topic, but I think it's a bug. We should not allow users pass 
```fitIntercept``` to control whether to fit intercept, this should be handled 
by formula. For example, if users specify formula ```y ~ a + b + c - 1```, then 
the model should be fitted w/o intercept. Could you please fix this bug as 
well? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88764190
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -674,6 +674,16 @@ test_that("spark.logit", {
   expect_error(summary(blr_model2))
   unlink(modelPath)
 
+  # test prediction label as text
+  training <- suppressWarnings(createDataFrame(iris))
+  binomial_training <- training[training$Species %in% c("versicolor", 
"virginica"), ]
+  binomial_model <- spark.logit(binomial_training, Species ~ Sepal_Length 
+ Sepal_Width)
+  prediction <- predict(binomial_model, binomial_training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 
1)$prediction), "character")
+  expected <- c("virginica", "virginica", "virginica", "versicolor", 
"virginica",
+"versicolor", "virginica", "versicolor", "virginica", 
"versicolor")
+  expect_equal(as.list(take(select(prediction, "prediction"), 10))[[1]], 
expected)
--- End diff --

how reliable is this test? the order of rows is not guaranteed unless it is 
enforced by a sort or something, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88764104
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -637,31 +637,31 @@ test_that("spark.isotonicRegression", {
 
 test_that("spark.logit", {
   # test binary logistic regression
-  label <- c(1.0, 1.0, 1.0, 0.0, 0.0)
+  label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
   feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
   binary_data <- as.data.frame(cbind(label, feature))
   binary_df <- createDataFrame(binary_data)
 
   blr_model <- spark.logit(binary_df, label ~ feature, thresholds = 1.0)
   blr_predict <- collect(select(predict(blr_model, binary_df), 
"prediction"))
-  expect_equal(blr_predict$prediction, c(0, 0, 0, 0, 0))
+  expect_equal(blr_predict$prediction, c("0.0", "0.0", "0.0", "0.0", 
"0.0"))
   blr_model1 <- spark.logit(binary_df, label ~ feature, thresholds = 0.0)
   blr_predict1 <- collect(select(predict(blr_model1, binary_df), 
"prediction"))
-  expect_equal(blr_predict1$prediction, c(1, 1, 1, 1, 1))
+  expect_equal(blr_predict1$prediction, c("1.0", "1.0", "1.0", "1.0", 
"1.0"))
 
   # test summary of binary logistic regression
   blr_summary <- summary(blr_model)
   blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, 
"threshold", "F-Measure"))
-  expect_equal(blr_fmeasure$threshold, c(0.8221347, 0.7884005, 0.6674709, 
0.3785437, 0.3434487),
-   tolerance = 1e-4)
-  expect_equal(blr_fmeasure$"F-Measure", c(0.500, 0.800, 
0.667, 0.8571429, 0.750),
-   tolerance = 1e-4)
+  expect_equal(blr_fmeasure$threshold, c(0.6565513, 0.6214563, 0.3325291, 
0.2115995, 0.1778653),
+  tolerance = 1e-4)
--- End diff --

nit: would be great to align the tolerance parameter with indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-16 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15910#discussion_r88363990
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -19,7 +19,7 @@ package org.apache.spark
 
 import java.io._
 import java.lang.reflect.Constructor
-import java.net.{MalformedURLException, URI}
--- End diff --

Removed unused import in this PR, because this one line change is not 
encouraged as a separate PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-16 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/15910

[SPARK-18476][SPARKR][ML]:SparkR Logistic Regression should should support 
output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression 
should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified 
and added. Document is also modified.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark audit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15910.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15910


commit e025d9894fa587c9aef837e7e1bbb11b1706b539
Author: wm...@hotmail.com 
Date:   2016-11-10T22:40:57Z

remove unused import

commit f3f1f0b74a47c17d301b075c136e365917216139
Author: wm...@hotmail.com 
Date:   2016-11-11T18:23:20Z

add label StringIndexer

commit 7177dd3e0985fa69e8ea7dc4203562f03364f0b1
Author: wm...@hotmail.com 
Date:   2016-11-17T00:28:40Z

modify test case; add output label




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org