[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16630


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r128200757
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Coefficient matrix with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.3.0")
+  lazy val coefficientMatrix: Array[(String, Double, Double, Double, 
Double)] = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ "(Intercept)"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+  result
+} else {
+  throw new UnsupportedOperationException(
+"No summary table available for this 
GeneralizedLinearRegressionModel")
+}
+  }
+
+  private def round(x: Double, digit: Int): String = {
+BigDecimal(x).setScale(digit, 
BigDecimal.RoundingMode.HALF_UP).toString()
+  }
+
+  private[regression] def showString(_numRows: Int, truncate: Int = 20,
+ numDigits: Int = 3): String = {
--- End diff --

Align.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r128200260
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Coefficient matrix with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.3.0")
+  lazy val coefficientMatrix: Array[(String, Double, Double, Double, 
Double)] = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ "(Intercept)"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+  result
+} else {
+  throw new UnsupportedOperationException(
+"No summary table available for this 
GeneralizedLinearRegressionModel")
+}
+  }
+
+  private def round(x: Double, digit: Int): String = {
+BigDecimal(x).setScale(digit, 
BigDecimal.RoundingMode.HALF_UP).toString()
+  }
+
+  private[regression] def showString(_numRows: Int, truncate: Int = 20,
+ numDigits: Int = 3): String = {
+val numRows = _numRows.max(1)
+val data = coefficientMatrix.take(numRows)
+val hasMoreData = coefficientMatrix.size > numRows
+
+val colNames = Array("Feature", "Estimate", "StdError", "TValue", 
"PValue")
+val numCols = colNames.size
+
+val rows = colNames +: data.map( row => {
+  val mrow = for (cell <- row.productIterator) yield {
+val str = cell match {
+  case s: String => s
+  case n: Double => round(n, numDigits).toString
+}
+if (truncate > 0 && str.length > truncate) {
+  // do not show ellipses for strings shorter than 4 characters.
+  if (truncate < 4) str.substring(0, truncate)
+  else str.substring(0, truncate - 3) + "..."
+} else {
+  str
+}
+  }
+  mrow.toArray
+})
+
+val sb = new StringBuilder
+val colWidths = Array.fill(numCols)(3)
+
+// Compute the width of each column
+for (row <- rows) {
+  for ((cell, i) <- row.zipWithIndex) {
+colWidths(i) = math.max(colWidths(i), cell.length)
+  }
+}
+
+// Create SeparateLine
+val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", 
"+\n").toString()
+
+// column names
+rows.head.zipWithIndex.map { case (cell, i) =>
+  if (truncate > 0) {
+StringUtils.leftPad(cell, colWidths(i))
+  } else {
+StringUtils.rightPad(cell, colWidths(i))
+  }
+}.addString(sb, "|", "|", "|\n")
+sb.append(sep)
+
+// data
+rows.tail.map {
+  _.zipWithIndex.map { case (cell, i) =>
+if (truncate > 0) {
+  StringUtils.leftPad(cell.toString, colWidths(i))
+} else {
+  StringUtils.rightPad(cell.toString, colWidths(i))
+}
+  }.addString(sb, "|", "|", "|\n")
+}
+
+// For Data that has more than "numRows" records
+if (hasMoreData) {
+  sb.append("...\n")
+  sb.append(sep)
+  val rowsString = if (numRows == 1) "row" else "rows"
+  sb.append(s"only showing top $numRows $rowsString\n")
+} else {
+  sb.append(sep)
+}
+
+sb.append("\n")
+sb.append(s"(Dispersion parameter for ${family.name} family taken to 
be " +
+  round(dispersion, numDigits) + ")")
+
+sb.append("\n")
+val nd = "Null deviance: " + round(nullDeviance, numDigits) +
+  s" on $degreesOfFreedom degrees of freedom"
+val rd = "Residual deviance: " + round(deviance, numDigits) +
+  s" on $residualDegreeOfFreedom degrees of freedom"
+val l = math.max(nd.length, rd.length)
+sb.append(StringUtils.leftPad(nd, l))
+sb.append("\n")
+sb.append(StringUtils.leftPad(rd, l))
+
+if (family.name != "tweedie") {
+  sb.append("\n")
+  sb.append(s"AIC: " + round(aic, numDigits))
+}
+
+sb.toString()
+  }
  

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r128202464
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Coefficient matrix with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.3.0")
+  lazy val coefficientMatrix: Array[(String, Double, Double, Double, 
Double)] = {
--- End diff --

This is not a matrix, so it's not appropriate to name it as 
```coefficientMatrix```. Since it's only used for generating summary string 
output, what about keep it private or inline?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r127853762
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -452,6 +452,8 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   private[regression] val epsilon: Double = 1E-16
 
+  private[regression] val Intercept: String = "(Intercept)"
--- End diff --

Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r127844484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
--- End diff --

Updated it as `coefficientMatrix`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r127844472
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r127844463
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1187,6 +1189,23 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   @Since("2.2.0")
   lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to feature column name with numbered suffix "_0", 
"_1", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureNames: Array[String] = {
--- End diff --

Made it `private[ml]` since it is used in the R wrapper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r126872238
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
--- End diff --

Could we have a better name? What about output model summary like R? Then 
we can directly override ```toString``` function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r126871919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -452,6 +452,8 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   private[regression] val epsilon: Double = 1E-16
 
+  private[regression] val Intercept: String = "(Intercept)"
--- End diff --

If this is only used once, it's better to eliminate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r126873611
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ Intercept
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
+  result.toDF("Feature", "Coefficient", "StdError", "TValue", 
"PValue").repartition(1)
--- End diff --

Could you let me know the reason of wrapping the result as a 
```DataFrame```? I think a local 2D array is enough. ```DataFrame``` adds some 
extra cost and actually it will collect as local array when you call 
```toString```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r126873706
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1187,6 +1189,23 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   @Since("2.2.0")
   lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to feature column name with numbered suffix "_0", 
"_1", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureNames: Array[String] = {
--- End diff --

Could we keep it private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-07-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r126872077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
--- End diff --

Bump up since version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r103004564
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ Intercept
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
--- End diff --

@felixcheung Would you elaborate on your concern here? The reason I create 
the SparkSession is to use convert `Seq` to `DataFrame` using `toDF`. Is there 
a way we can create data frame without explicitly using spark session? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r103003591
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
 ---
@@ -99,37 +95,23 @@ private[r] object GeneralizedLinearRegressionWrapper
 val summary = glm.summary
 
 val rFeatures: Array[String] = if (glm.getFitIntercept) {
-  Array("(Intercept)") ++ features
+  Array("(Intercept)") ++ summary.featureNames
 } else {
-  features
+  summary.featureNames
 }
 
 val rCoefficients: Array[Double] = if (summary.isNormalSolver) {
-  val rCoefficientStandardErrors = if (glm.getFitIntercept) {
-Array(summary.coefficientStandardErrors.last) ++
-  summary.coefficientStandardErrors.dropRight(1)
-  } else {
-summary.coefficientStandardErrors
-  }
+  val rCoefficientStandardErrors =
+summary.summaryTable.select("StdError").collect.map(_.getDouble(0))
 
-  val rTValues = if (glm.getFitIntercept) {
-Array(summary.tValues.last) ++ summary.tValues.dropRight(1)
-  } else {
-summary.tValues
-  }
+  val rTValues =
+summary.summaryTable.select("TValue").collect.map(_.getDouble(0))
 
-  val rPValues = if (glm.getFitIntercept) {
-Array(summary.pValues.last) ++ summary.pValues.dropRight(1)
-  } else {
-summary.pValues
-  }
+  val rPValues =
+summary.summaryTable.select("PValue").collect.map(_.getDouble(0))
 
-  if (glm.getFitIntercept) {
-Array(glm.intercept) ++ glm.coefficients.toArray ++
-  rCoefficientStandardErrors ++ rTValues ++ rPValues
-  } else {
-glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues 
++ rPValues
-  }
+  
summary.summaryTable.select("Coefficient").collect.map(_.getDouble(0)) ++
+rCoefficientStandardErrors ++ rTValues ++ rPValues
--- End diff --

Yes, I checked the results, and they are the same. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-24 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r103003515
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Column, DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
+import org.apache.spark.sql.SparkSession
--- End diff --

Thanks. Sorted the imports. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-22 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r102564640
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Column, DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
+import org.apache.spark.sql.SparkSession
--- End diff --

we generally try to sort the import...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-22 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r102566039
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ Intercept
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
--- End diff --

I'm concerned about this - the current session might not always be the 
right one. 
If we need an instance of SparkSession it would be preferable in the way 
MLReader/BaseReadWrite does:

https://github.com/apache/spark/blob/04ee8cf633e17b6bf95225a8dd77bf2e06980eb3/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L59



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-22 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r102564507
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
 ---
@@ -99,37 +95,23 @@ private[r] object GeneralizedLinearRegressionWrapper
 val summary = glm.summary
 
 val rFeatures: Array[String] = if (glm.getFitIntercept) {
-  Array("(Intercept)") ++ features
+  Array("(Intercept)") ++ summary.featureNames
 } else {
-  features
+  summary.featureNames
 }
 
 val rCoefficients: Array[Double] = if (summary.isNormalSolver) {
-  val rCoefficientStandardErrors = if (glm.getFitIntercept) {
-Array(summary.coefficientStandardErrors.last) ++
-  summary.coefficientStandardErrors.dropRight(1)
-  } else {
-summary.coefficientStandardErrors
-  }
+  val rCoefficientStandardErrors =
+summary.summaryTable.select("StdError").collect.map(_.getDouble(0))
 
-  val rTValues = if (glm.getFitIntercept) {
-Array(summary.tValues.last) ++ summary.tValues.dropRight(1)
-  } else {
-summary.tValues
-  }
+  val rTValues =
+summary.summaryTable.select("TValue").collect.map(_.getDouble(0))
 
-  val rPValues = if (glm.getFitIntercept) {
-Array(summary.pValues.last) ++ summary.pValues.dropRight(1)
-  } else {
-summary.pValues
-  }
+  val rPValues =
+summary.summaryTable.select("PValue").collect.map(_.getDouble(0))
 
-  if (glm.getFitIntercept) {
-Array(glm.intercept) ++ glm.coefficients.toArray ++
-  rCoefficientStandardErrors ++ rTValues ++ rPValues
-  } else {
-glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues 
++ rPValues
-  }
+  
summary.summaryTable.select("Coefficient").collect.map(_.getDouble(0)) ++
+rCoefficientStandardErrors ++ rTValues ++ rPValues
--- End diff --

Could you run a quick check to see if the values in the SparkR summary are 
the same before and after this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101822942
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("features_0", "features_1"), 
Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureNames.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("features_0", "features_1"),
+  Array("(Intercept)", "features_0", "features_1"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
--- End diff --

Thanks. Added in R code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-17 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101822971
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("features_0", "features_1"), 
Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureNames.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("features_0", "features_1"),
+  Array("(Intercept)", "features_0", "features_1"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
+.setFitIntercept(fitIntercept)
+  val model = trainer.fit(dataset)
+  val summaryTable = model.summary.summaryTable
+
+  summaryTable.select("Feature").collect.map(_.getString(0))
+.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2,
+"Feature name mismatch in summaryTable") }
+  
assert(Vectors.dense(summaryTable.select("Coefficient").rdd.collect.map(_.getDouble(0)))
+~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in 
summaryTable")
+  
assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0)))
+~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in 
summaryTable")
+  
assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0)))
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101362237
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("features_0", "features_1"), 
Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureNames.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("features_0", "features_1"),
+  Array("(Intercept)", "features_0", "features_1"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
+.setFitIntercept(fitIntercept)
+  val model = trainer.fit(dataset)
+  val summaryTable = model.summary.summaryTable
+
+  summaryTable.select("Feature").collect.map(_.getString(0))
+.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2,
+"Feature name mismatch in summaryTable") }
+  
assert(Vectors.dense(summaryTable.select("Coefficient").rdd.collect.map(_.getDouble(0)))
+~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in 
summaryTable")
+  
assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0)))
+~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in 
summaryTable")
+  
assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0)))
--- End diff --

it looks like for all of these below you can just call collect instead of 
rdd.collect?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101362084
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("features_0", "features_1"), 
Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureNames.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("features_0", "features_1"),
+  Array("(Intercept)", "features_0", "features_1"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
--- End diff --

is this comparing the summary to the results of R?  If so, in general you 
should add the R code in a comment that was used to generate the expected 
results so that the expected values are reproducible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101357001
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +919,23 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to feature column name with numbered suffix "_0", 
"_1", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureNames: Array[String] = {
+val featureAttrs = AttributeGroup.fromStructField(
+  dataset.schema(model.getFeaturesCol)).attributes
+if (featureAttrs == None) {
+  Array.tabulate[String](origModel.numFeatures)(
+(x: Int) => (model.getFeaturesCol + "_" + x))
--- End diff --

in general I would have preferred to create a platform-level function (or 
use one if it exists) to format the strings in the same way, so there is no 
duplicate code in VectorAssembler vs here that can diverge (and which other 
functions in spark can generally use).  However, this seems a bit out of scope 
of this code review, so I don't think you need to do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101356475
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNamesLocal = featureNames
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNamesLocal = featureNamesLocal :+ Intercept
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNamesLocal(i), coefficients(i), 
coefficientStandardErrors(i),
+tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
+  result.toDF("Feature", "Coefficient", "StdError", "TValue", 
"PValue").repartition(1)
--- End diff --

Sorry, I didn't realize that R uses Estimate instead of coefficient - if 
you feel strongly about using Estimate here instead you can change this back.  
Up to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101356243
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureName.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("V1", "V2"),
+  Array("Intercept", "V1", "V2"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
--- End diff --

I would usually prefer to use variables wherever possible as it is much 
easier to update through various editors and in general it is much easier to 
catch compile time vs runtime errors.  But it is a minor point, and it looks 
like this is consistent with most of the spark codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101220747
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureName.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("V1", "V2"),
+  Array("Intercept", "V1", "V2"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
--- End diff --

Guassian.name.toLowerCase (or Guassian.name since it is converted to 
lowercase later) would be generally the approach.

but this is test suite, I think it's ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101220572
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
+val featureAttrs = AttributeGroup.fromStructField(
+  dataset.schema(model.getFeaturesCol)).attributes
+if (featureAttrs == None) {
+  Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x 
+ 1)))
--- End diff --

quite possibly - could you check what would be new or removed with that 
approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101160217
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNames(i), coefficients(i), coefficientStandardErrors(i), 
tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
+  result.toDF("Feature", "Estimate", "StdError", "TValue", 
"PValue").repartition(1)
--- End diff --

R was using 'Estimate'. I changed it to 'Coefficient' now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101160069
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
--- End diff --

OK, changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101159255
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureName.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("V1", "V2"),
+  Array("Intercept", "V1", "V2"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
--- End diff --

Indeed, there is object `Gaussian` and one can use `Gaussian.name` for the 
string name. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101159105
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
+val featureAttrs = AttributeGroup.fromStructField(
+  dataset.schema(model.getFeaturesCol)).attributes
+if (featureAttrs == None) {
--- End diff --

@imatiach-msft This makes sense. I now changed the code to mirror the same 
logic. When attritubes are missing, the default name is set to be the feature 
name with suffix "_0", "_1" etc. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101159146
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101158825
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNames(i), coefficients(i), coefficientStandardErrors(i), 
tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
--- End diff --

I was using the spark session and implicits to be able to use `toDF` to 
create data frame with names from `Seq`. Could you explain how this `import 
dataset.sparkSession.implicits._` works? Could not import it in spark shell.
```
:56: error: not found: value dataset
   import dataset.sparkSession.implicits._
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-14 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r101158362
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
+val featureAttrs = AttributeGroup.fromStructField(
+  dataset.schema(model.getFeaturesCol)).attributes
+if (featureAttrs == None) {
+  Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x 
+ 1)))
--- End diff --

@felixcheung The feature names were not available prior to this PR, right? 
One other place I see that does similar summary is the 
`GeneralizedLinearRegressionWrapper` for R. Do you think we should consolidate 
the two, e.g., update the `Wrapper` to use the summary table directly? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100933747
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
--- End diff --

minor comment: it looks like this is an array so should be plural, as in 
"featureNames" instead of "featureName" without the s


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100933207
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNames(i), coefficients(i), coefficientStandardErrors(i), 
tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
+  result.toDF("Feature", "Estimate", "StdError", "TValue", 
"PValue").repartition(1)
--- End diff --

question: is "Estimate" the better term to use here as opposed to 
"Coefficient"?  Are there other libraries which use this specific term in this 
case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100932561
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNames(i), coefficients(i), coefficientStandardErrors(i), 
tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
+  result.toDF("Feature", "Estimate", "StdError", "TValue", 
"PValue").repartition(1)
+} else {
+  throw new UnsupportedOperationException(
+"No summary table available for this 
GeneralizedLinearRegressionModel")
--- End diff --

minor suggestion: it would be nice to add a test to verify this exception 
is thrown (and with the right error message using the withClue() check)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100932182
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite
   .fit(datasetGaussianIdentity.as[LabeledPoint])
   }
 
+
+  test("glm summary: feature name") {
+// dataset1 with no attribute
+val dataset1 = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+// dataset2 with attribute
+val datasetTmp = Seq(
+  (2.0, 1.0, 0.0, 5.0),
+  (8.0, 2.0, 1.0, 7.0),
+  (3.0, 3.0, 2.0, 11.0),
+  (9.0, 4.0, 3.0, 13.0),
+  (2.0, 5.0, 2.0, 3.0)
+).toDF("y", "w", "x1", "x2")
+val formula = new RFormula().setFormula("y ~ x1 + x2")
+val dataset2 = formula.fit(datasetTmp).transform(datasetTmp)
+
+val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2"))
+
+var idx = 0
+for (dataset <- Seq(dataset1, dataset2)) {
+  val model = new GeneralizedLinearRegression().fit(dataset)
+  model.summary.featureName.zip(expectedFeature(idx))
+.foreach{ x => assert(x._1 === x._2) }
+  idx += 1
+}
+  }
+
+  test("glm summary: summaryTable") {
+val dataset = Seq(
+  Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)),
+  Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)),
+  Instance(2.0, 5.0, Vectors.dense(2.0, 3.0))
+).toDF()
+
+val expectedFeature = Seq(Array("V1", "V2"),
+  Array("Intercept", "V1", "V2"))
+val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
+  Vectors.dense(0.7903, 0.2258, 0.4677))
+val expectedStdError = Seq(Vectors.dense(1.724, 0.3787),
+  Vectors.dense(4.0129, 2.1153, 0.5815))
+val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205),
+  Vectors.dense(0.1969, 0.1067, 0.8043))
+val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506),
+  Vectors.dense(0.8621, 0.9247, 0.5056))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val trainer = new GeneralizedLinearRegression()
+.setFamily("gaussian")
--- End diff --

not related to this code review, but it's unfortunate that these aren't 
constants that can be referenced from the model, it's messy to have to type 
strings like this everywhere as opposed to referencing variables


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100931568
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
--- End diff --

would it be possible to move this to a constant ("Intercept")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100931445
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
   /** Number of instances in DataFrame predictions. */
   private[regression] lazy val numInstances: Long = predictions.count()
 
+
+  /**
+   * Name of features. If the name cannot be retrieved from attributes,
+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {
+val featureAttrs = AttributeGroup.fromStructField(
+  dataset.schema(model.getFeaturesCol)).attributes
+if (featureAttrs == None) {
--- End diff --

if I do the example below in spark-shell:

import org.apache.spark.ml.feature.HashingTF
val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), 
Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, 
Array("b","c"), 7))).toDF("y", "x", "z")
val dfres = tf.transform(df)

when doing show():
scala> dfres.show
+---+--+---++
|  y| x|  z|hash|
+---+--+---++
|0.0|[a, b]|  4|(262144,[30913,22...|
|1.0|[b, c]|  6|(262144,[28698,30...|
|1.0|[a, c]|  7|(262144,[28698,22...|
|0.0|[b, c]|  7|(262144,[28698,30...|
+---+--+---++

but, when I look at schema:
import org.apache.spark.ml.attribute.AttributeGroup
scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None

scala> AttributeGroup.fromStructField(dfres.schema("hash"))
res6: org.apache.spark.ml.attribute.AttributeGroup = 
{"ml_attr":{"num_attrs":262144}}

but in this case the name should be of the form: hash_{#}
instead of V{#}
for example, when using VectorAssembler on the above:
import org.apache.spark.ml.feature.VectorAssembler
val va = new 
VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
scala> va.transform(dfres).show()
+---+--+---+++
|  y| x|  z|hash| outputs|
+---+--+---+++
|0.0|[a, b]|  4|(262144,[30913,22...|(262146,[1,30915,...|
|1.0|[b, c]|  6|(262144,[28698,30...|(262146,[0,1,2870...|
|1.0|[a, c]|  7|(262144,[28698,22...|(262146,[0,1,2870...|
|0.0|[b, c]|  7|(262144,[28698,30...|(262146,[1,28700,...|
+---+--+---+++

scala> 
print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
[Lorg.apache.spark.ml.attribute.Attribute;@4416197b
scala> 
AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
res22: Array[org.apache.spark.ml.attribute.Attribute] = 
Array({"type":"numeric","idx":0,"name":"y"}, 
{"type":"numeric","idx":1,"name":"z"}, 
{"type":"numeric","idx":2,"name":"hash_0"}, 
{"type":"numeric","idx":3,"name":"hash_1"}, 
{"type":"numeric","idx":4,"name":"hash_2"}, 
{"type":"numeric","idx":5,"name":"hash_3"}, 
{"type":"numeric","idx":6,"name":"hash_4"}, 
{"type":"numeric","idx":7,"name":"hash_5"}, 
{"type":"numeric","idx":8,"name":"hash_6"}, 
{"type":"numeric","idx":9,"name":"hash_7"}, 
{"type":"numeric","idx":10,"name":"hash_8"}, 
{"type":"numeric","idx":11,"name":"hash_9"}, 
{"type":"numeric","idx":12,"name":"hash_10"}, 
{"type":"numeric","idx":13,"name":"hash_11"}, 
{"type":"numeric","idx":14,"name":"hash_12"}, 
{"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...

you can see that the attributes are given the column name followed by the 
index.
This seems like a bug in the VectorAssembler, because it is making the 
schema dense when it should be sparse, but regardless this seems to be the more 
official way to represent the name of the attributes instead of using a "V" 
followed by index - unless you have seen the "V" + index used elsewhere?







---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-02-13 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16630#discussion_r100927396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
 "No p-value available for this GeneralizedLinearRegressionModel")
 }
   }
+
+  /**
+   * Summary table with feature name, coefficient, standard error,
+   * tValue and pValue.
+   */
+  @Since("2.2.0")
+  lazy val summaryTable: DataFrame = {
+if (isNormalSolver) {
+  var featureNames = featureName
+  var coefficients = model.coefficients.toArray
+  var idx = Array.range(0, coefficients.length)
+  if (model.getFitIntercept) {
+featureNames = featureNames :+ "Intercept"
+coefficients = coefficients :+ model.intercept
+// Reorder so that intercept comes first
+idx = (coefficients.length - 1) +: idx
+  }
+  val result = for (i <- idx.toSeq) yield
+(featureNames(i), coefficients(i), coefficientStandardErrors(i), 
tValues(i), pValues(i))
+
+  val spark = SparkSession.builder().getOrCreate()
+  import spark.implicits._
--- End diff --

minor comment: might it be better to simplify this as "import 
dataset.sparkSession.implicits._", or is there a reason to prefer the 
SparkSession.builder().getOrCreate()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

2017-01-17 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/16630

[SPARK-19270][ML] Add summary table to GLM summary

## What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if 
exist), parameter estimate, standard error, t-stat and p-value. This allows 
scala users to easily gather these commonly used inference results.

@srowen @yanboliang 

## How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary 
Table. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark glmTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16630






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org