[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16630 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r128200757 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Coefficient matrix with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.3.0") + lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ "(Intercept)" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + result +} else { + throw new UnsupportedOperationException( +"No summary table available for this GeneralizedLinearRegressionModel") +} + } + + private def round(x: Double, digit: Int): String = { +BigDecimal(x).setScale(digit, BigDecimal.RoundingMode.HALF_UP).toString() + } + + private[regression] def showString(_numRows: Int, truncate: Int = 20, + numDigits: Int = 3): String = { --- End diff -- Align. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r128200260 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Coefficient matrix with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.3.0") + lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ "(Intercept)" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + result +} else { + throw new UnsupportedOperationException( +"No summary table available for this GeneralizedLinearRegressionModel") +} + } + + private def round(x: Double, digit: Int): String = { +BigDecimal(x).setScale(digit, BigDecimal.RoundingMode.HALF_UP).toString() + } + + private[regression] def showString(_numRows: Int, truncate: Int = 20, + numDigits: Int = 3): String = { +val numRows = _numRows.max(1) +val data = coefficientMatrix.take(numRows) +val hasMoreData = coefficientMatrix.size > numRows + +val colNames = Array("Feature", "Estimate", "StdError", "TValue", "PValue") +val numCols = colNames.size + +val rows = colNames +: data.map( row => { + val mrow = for (cell <- row.productIterator) yield { +val str = cell match { + case s: String => s + case n: Double => round(n, numDigits).toString +} +if (truncate > 0 && str.length > truncate) { + // do not show ellipses for strings shorter than 4 characters. + if (truncate < 4) str.substring(0, truncate) + else str.substring(0, truncate - 3) + "..." +} else { + str +} + } + mrow.toArray +}) + +val sb = new StringBuilder +val colWidths = Array.fill(numCols)(3) + +// Compute the width of each column +for (row <- rows) { + for ((cell, i) <- row.zipWithIndex) { +colWidths(i) = math.max(colWidths(i), cell.length) + } +} + +// Create SeparateLine +val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString() + +// column names +rows.head.zipWithIndex.map { case (cell, i) => + if (truncate > 0) { +StringUtils.leftPad(cell, colWidths(i)) + } else { +StringUtils.rightPad(cell, colWidths(i)) + } +}.addString(sb, "|", "|", "|\n") +sb.append(sep) + +// data +rows.tail.map { + _.zipWithIndex.map { case (cell, i) => +if (truncate > 0) { + StringUtils.leftPad(cell.toString, colWidths(i)) +} else { + StringUtils.rightPad(cell.toString, colWidths(i)) +} + }.addString(sb, "|", "|", "|\n") +} + +// For Data that has more than "numRows" records +if (hasMoreData) { + sb.append("...\n") + sb.append(sep) + val rowsString = if (numRows == 1) "row" else "rows" + sb.append(s"only showing top $numRows $rowsString\n") +} else { + sb.append(sep) +} + +sb.append("\n") +sb.append(s"(Dispersion parameter for ${family.name} family taken to be " + + round(dispersion, numDigits) + ")") + +sb.append("\n") +val nd = "Null deviance: " + round(nullDeviance, numDigits) + + s" on $degreesOfFreedom degrees of freedom" +val rd = "Residual deviance: " + round(deviance, numDigits) + + s" on $residualDegreeOfFreedom degrees of freedom" +val l = math.max(nd.length, rd.length) +sb.append(StringUtils.leftPad(nd, l)) +sb.append("\n") +sb.append(StringUtils.leftPad(rd, l)) + +if (family.name != "tweedie") { + sb.append("\n") + sb.append(s"AIC: " + round(aic, numDigits)) +} + +sb.toString() + }
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r128202464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1458,4 +1475,167 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Coefficient matrix with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.3.0") + lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = { --- End diff -- This is not a matrix, so it's not appropriate to name it as ```coefficientMatrix```. Since it's only used for generating summary string output, what about keep it private or inline? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r127853762 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -452,6 +452,8 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] val epsilon: Double = 1E-16 + private[regression] val Intercept: String = "(Intercept)" --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r127844484 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { --- End diff -- Updated it as `coefficientMatrix`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r127844472 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r127844463 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1187,6 +1189,23 @@ class GeneralizedLinearRegressionSummary private[regression] ( @Since("2.2.0") lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to feature column name with numbered suffix "_0", "_1", and so on. + */ + @Since("2.2.0") + lazy val featureNames: Array[String] = { --- End diff -- Made it `private[ml]` since it is used in the R wrapper. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r126872238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { --- End diff -- Could we have a better name? What about output model summary like R? Then we can directly override ```toString``` function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r126871919 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -452,6 +452,8 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine private[regression] val epsilon: Double = 1E-16 + private[regression] val Intercept: String = "(Intercept)" --- End diff -- If this is only used once, it's better to eliminate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r126873611 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ Intercept +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ + result.toDF("Feature", "Coefficient", "StdError", "TValue", "PValue").repartition(1) --- End diff -- Could you let me know the reason of wrapping the result as a ```DataFrame```? I think a local 2D array is enough. ```DataFrame``` adds some extra cost and actually it will collect as local array when you call ```toString```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r126873706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1187,6 +1189,23 @@ class GeneralizedLinearRegressionSummary private[regression] ( @Since("2.2.0") lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to feature column name with numbered suffix "_0", "_1", and so on. + */ + @Since("2.2.0") + lazy val featureNames: Array[String] = { --- End diff -- Could we keep it private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r126872077 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1441,4 +1460,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") --- End diff -- Bump up since version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103004564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ Intercept +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() --- End diff -- @felixcheung Would you elaborate on your concern here? The reason I create the SparkSession is to use convert `Seq` to `DataFrame` using `toDF`. Is there a way we can create data frame without explicitly using spark session? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -99,37 +95,23 @@ private[r] object GeneralizedLinearRegressionWrapper val summary = glm.summary val rFeatures: Array[String] = if (glm.getFitIntercept) { - Array("(Intercept)") ++ features + Array("(Intercept)") ++ summary.featureNames } else { - features + summary.featureNames } val rCoefficients: Array[Double] = if (summary.isNormalSolver) { - val rCoefficientStandardErrors = if (glm.getFitIntercept) { -Array(summary.coefficientStandardErrors.last) ++ - summary.coefficientStandardErrors.dropRight(1) - } else { -summary.coefficientStandardErrors - } + val rCoefficientStandardErrors = +summary.summaryTable.select("StdError").collect.map(_.getDouble(0)) - val rTValues = if (glm.getFitIntercept) { -Array(summary.tValues.last) ++ summary.tValues.dropRight(1) - } else { -summary.tValues - } + val rTValues = +summary.summaryTable.select("TValue").collect.map(_.getDouble(0)) - val rPValues = if (glm.getFitIntercept) { -Array(summary.pValues.last) ++ summary.pValues.dropRight(1) - } else { -summary.pValues - } + val rPValues = +summary.summaryTable.select("PValue").collect.map(_.getDouble(0)) - if (glm.getFitIntercept) { -Array(glm.intercept) ++ glm.coefficients.toArray ++ - rCoefficientStandardErrors ++ rTValues ++ rPValues - } else { -glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues ++ rPValues - } + summary.summaryTable.select("Coefficient").collect.map(_.getDouble(0)) ++ +rCoefficientStandardErrors ++ rTValues ++ rPValues --- End diff -- Yes, I checked the results, and they are the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r103003515 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Column, DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DataType, DoubleType, StructType} +import org.apache.spark.sql.SparkSession --- End diff -- Thanks. Sorted the imports. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r102564640 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -34,6 +35,7 @@ import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Column, DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DataType, DoubleType, StructType} +import org.apache.spark.sql.SparkSession --- End diff -- we generally try to sort the import... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r102566039 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ Intercept +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() --- End diff -- I'm concerned about this - the current session might not always be the right one. If we need an instance of SparkSession it would be preferable in the way MLReader/BaseReadWrite does: https://github.com/apache/spark/blob/04ee8cf633e17b6bf95225a8dd77bf2e06980eb3/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L59 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r102564507 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -99,37 +95,23 @@ private[r] object GeneralizedLinearRegressionWrapper val summary = glm.summary val rFeatures: Array[String] = if (glm.getFitIntercept) { - Array("(Intercept)") ++ features + Array("(Intercept)") ++ summary.featureNames } else { - features + summary.featureNames } val rCoefficients: Array[Double] = if (summary.isNormalSolver) { - val rCoefficientStandardErrors = if (glm.getFitIntercept) { -Array(summary.coefficientStandardErrors.last) ++ - summary.coefficientStandardErrors.dropRight(1) - } else { -summary.coefficientStandardErrors - } + val rCoefficientStandardErrors = +summary.summaryTable.select("StdError").collect.map(_.getDouble(0)) - val rTValues = if (glm.getFitIntercept) { -Array(summary.tValues.last) ++ summary.tValues.dropRight(1) - } else { -summary.tValues - } + val rTValues = +summary.summaryTable.select("TValue").collect.map(_.getDouble(0)) - val rPValues = if (glm.getFitIntercept) { -Array(summary.pValues.last) ++ summary.pValues.dropRight(1) - } else { -summary.pValues - } + val rPValues = +summary.summaryTable.select("PValue").collect.map(_.getDouble(0)) - if (glm.getFitIntercept) { -Array(glm.intercept) ++ glm.coefficients.toArray ++ - rCoefficientStandardErrors ++ rTValues ++ rPValues - } else { -glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues ++ rPValues - } + summary.summaryTable.select("Coefficient").collect.map(_.getDouble(0)) ++ +rCoefficientStandardErrors ++ rTValues ++ rPValues --- End diff -- Could you run a quick check to see if the values in the SparkR summary are the same before and after this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822942 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("features_0", "features_1"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureNames.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("features_0", "features_1"), + Array("(Intercept)", "features_0", "features_1")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), --- End diff -- Thanks. Added in R code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101822971 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("features_0", "features_1"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureNames.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("features_0", "features_1"), + Array("(Intercept)", "features_0", "features_1")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") +.setFitIntercept(fitIntercept) + val model = trainer.fit(dataset) + val summaryTable = model.summary.summaryTable + + summaryTable.select("Feature").collect.map(_.getString(0)) +.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2, +"Feature name mismatch in summaryTable") } + assert(Vectors.dense(summaryTable.select("Coefficient").rdd.collect.map(_.getDouble(0))) +~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable") + assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0))) +~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in summaryTable") + assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0))) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101362237 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("features_0", "features_1"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureNames.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("features_0", "features_1"), + Array("(Intercept)", "features_0", "features_1")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") +.setFitIntercept(fitIntercept) + val model = trainer.fit(dataset) + val summaryTable = model.summary.summaryTable + + summaryTable.select("Feature").collect.map(_.getString(0)) +.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2, +"Feature name mismatch in summaryTable") } + assert(Vectors.dense(summaryTable.select("Coefficient").rdd.collect.map(_.getDouble(0))) +~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable") + assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0))) +~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in summaryTable") + assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0))) --- End diff -- it looks like for all of these below you can just call collect instead of rdd.collect? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101362084 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("features_0", "features_1"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureNames.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("features_0", "features_1"), + Array("(Intercept)", "features_0", "features_1")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), --- End diff -- is this comparing the summary to the results of R? If so, in general you should add the R code in a comment that was used to generate the expected results so that the expected values are reproducible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101357001 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +919,23 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to feature column name with numbered suffix "_0", "_1", and so on. + */ + @Since("2.2.0") + lazy val featureNames: Array[String] = { +val featureAttrs = AttributeGroup.fromStructField( + dataset.schema(model.getFeaturesCol)).attributes +if (featureAttrs == None) { + Array.tabulate[String](origModel.numFeatures)( +(x: Int) => (model.getFeaturesCol + "_" + x)) --- End diff -- in general I would have preferred to create a platform-level function (or use one if it exists) to format the strings in the same way, so there is no duplicate code in VectorAssembler vs here that can diverge (and which other functions in spark can generally use). However, this seems a bit out of scope of this code review, so I don't think you need to do this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101356475 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1173,33 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNamesLocal = featureNames + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNamesLocal = featureNamesLocal :+ Intercept +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNamesLocal(i), coefficients(i), coefficientStandardErrors(i), +tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ + result.toDF("Feature", "Coefficient", "StdError", "TValue", "PValue").repartition(1) --- End diff -- Sorry, I didn't realize that R uses Estimate instead of coefficient - if you feel strongly about using Estimate here instead you can change this back. Up to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101356243 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureName.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("V1", "V2"), + Array("Intercept", "V1", "V2")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") --- End diff -- I would usually prefer to use variables wherever possible as it is much easier to update through various editors and in general it is much easier to catch compile time vs runtime errors. But it is a minor point, and it looks like this is consistent with most of the spark codebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101220747 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureName.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("V1", "V2"), + Array("Intercept", "V1", "V2")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") --- End diff -- Guassian.name.toLowerCase (or Guassian.name since it is converted to lowercase later) would be generally the approach. but this is test suite, I think it's ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101220572 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { +val featureAttrs = AttributeGroup.fromStructField( + dataset.schema(model.getFeaturesCol)).attributes +if (featureAttrs == None) { + Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x + 1))) --- End diff -- quite possibly - could you check what would be new or removed with that approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160217 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ + result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1) --- End diff -- R was using 'Estimate'. I changed it to 'Coefficient' now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101160069 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { --- End diff -- OK, changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159255 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureName.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("V1", "V2"), + Array("Intercept", "V1", "V2")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") --- End diff -- Indeed, there is object `Gaussian` and one can use `Gaussian.name` for the string name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { +val featureAttrs = AttributeGroup.fromStructField( + dataset.schema(model.getFeaturesCol)).attributes +if (featureAttrs == None) { --- End diff -- @imatiach-msft This makes sense. I now changed the code to mirror the same logic. When attritubes are missing, the default name is set to be the feature name with suffix "_0", "_1" etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101159146 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158825 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ --- End diff -- I was using the spark session and implicits to be able to use `toDF` to create data frame with names from `Seq`. Could you explain how this `import dataset.sparkSession.implicits._` works? Could not import it in spark shell. ``` :56: error: not found: value dataset import dataset.sparkSession.implicits._ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r101158362 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { +val featureAttrs = AttributeGroup.fromStructField( + dataset.schema(model.getFeaturesCol)).attributes +if (featureAttrs == None) { + Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x + 1))) --- End diff -- @felixcheung The feature names were not available prior to this PR, right? One other place I see that does similar summary is the `GeneralizedLinearRegressionWrapper` for R. Do you think we should consolidate the two, e.g., update the `Wrapper` to use the summary table directly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100933747 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { --- End diff -- minor comment: it looks like this is an array so should be plural, as in "featureNames" instead of "featureName" without the s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100933207 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ + result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1) --- End diff -- question: is "Estimate" the better term to use here as opposed to "Coefficient"? Are there other libraries which use this specific term in this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100932561 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ + result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1) +} else { + throw new UnsupportedOperationException( +"No summary table available for this GeneralizedLinearRegressionModel") --- End diff -- minor suggestion: it would be nice to add a test to verify this exception is thrown (and with the right error message using the withClue() check) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100932182 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -1104,6 +1103,83 @@ class GeneralizedLinearRegressionSuite .fit(datasetGaussianIdentity.as[LabeledPoint]) } + + test("glm summary: feature name") { +// dataset1 with no attribute +val dataset1 = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +// dataset2 with attribute +val datasetTmp = Seq( + (2.0, 1.0, 0.0, 5.0), + (8.0, 2.0, 1.0, 7.0), + (3.0, 3.0, 2.0, 11.0), + (9.0, 4.0, 3.0, 13.0), + (2.0, 5.0, 2.0, 3.0) +).toDF("y", "w", "x1", "x2") +val formula = new RFormula().setFormula("y ~ x1 + x2") +val dataset2 = formula.fit(datasetTmp).transform(datasetTmp) + +val expectedFeature = Seq(Array("V1", "V2"), Array("x1", "x2")) + +var idx = 0 +for (dataset <- Seq(dataset1, dataset2)) { + val model = new GeneralizedLinearRegression().fit(dataset) + model.summary.featureName.zip(expectedFeature(idx)) +.foreach{ x => assert(x._1 === x._2) } + idx += 1 +} + } + + test("glm summary: summaryTable") { +val dataset = Seq( + Instance(2.0, 1.0, Vectors.dense(0.0, 5.0)), + Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(9.0, 4.0, Vectors.dense(3.0, 13.0)), + Instance(2.0, 5.0, Vectors.dense(2.0, 3.0)) +).toDF() + +val expectedFeature = Seq(Array("V1", "V2"), + Array("Intercept", "V1", "V2")) +val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), + Vectors.dense(0.7903, 0.2258, 0.4677)) +val expectedStdError = Seq(Vectors.dense(1.724, 0.3787), + Vectors.dense(4.0129, 2.1153, 0.5815)) +val expectedTValue = Seq(Vectors.dense(0.1673, 1.4205), + Vectors.dense(0.1969, 0.1067, 0.8043)) +val expectedPValue = Seq(Vectors.dense(0.8778, 0.2506), + Vectors.dense(0.8621, 0.9247, 0.5056)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression() +.setFamily("gaussian") --- End diff -- not related to this code review, but it's unfortunate that these aren't constants that can be referenced from the model, it's messy to have to type strings like this everywhere as opposed to referencing variables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100931568 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" --- End diff -- would it be possible to move this to a constant ("Intercept") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100931445 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary private[regression] ( /** Number of instances in DataFrame predictions. */ private[regression] lazy val numInstances: Long = predictions.count() + + /** + * Name of features. If the name cannot be retrieved from attributes, + * set default names to "V1", "V2", and so on. + */ + @Since("2.2.0") + lazy val featureName: Array[String] = { +val featureAttrs = AttributeGroup.fromStructField( + dataset.schema(model.getFeaturesCol)).attributes +if (featureAttrs == None) { --- End diff -- if I do the example below in spark-shell: import org.apache.spark.ml.feature.HashingTF val tf = new HashingTF().setInputCol("x").setOutputCol("hash") val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, Array("b","c"), 7))).toDF("y", "x", "z") val dfres = tf.transform(df) when doing show(): scala> dfres.show +---+--+---++ | y| x| z|hash| +---+--+---++ |0.0|[a, b]| 4|(262144,[30913,22...| |1.0|[b, c]| 6|(262144,[28698,30...| |1.0|[a, c]| 7|(262144,[28698,22...| |0.0|[b, c]| 7|(262144,[28698,30...| +---+--+---++ but, when I look at schema: import org.apache.spark.ml.attribute.AttributeGroup scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None scala> AttributeGroup.fromStructField(dfres.schema("hash")) res6: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"num_attrs":262144}} but in this case the name should be of the form: hash_{#} instead of V{#} for example, when using VectorAssembler on the above: import org.apache.spark.ml.feature.VectorAssembler val va = new VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs") scala> va.transform(dfres).show() +---+--+---+++ | y| x| z|hash| outputs| +---+--+---+++ |0.0|[a, b]| 4|(262144,[30913,22...|(262146,[1,30915,...| |1.0|[b, c]| 6|(262144,[28698,30...|(262146,[0,1,2870...| |1.0|[a, c]| 7|(262144,[28698,22...|(262146,[0,1,2870...| |0.0|[b, c]| 7|(262144,[28698,30...|(262146,[1,28700,...| +---+--+---+++ scala> print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get) [Lorg.apache.spark.ml.attribute.Attribute;@4416197b scala> AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get res22: Array[org.apache.spark.ml.attribute.Attribute] = Array({"type":"numeric","idx":0,"name":"y"}, {"type":"numeric","idx":1,"name":"z"}, {"type":"numeric","idx":2,"name":"hash_0"}, {"type":"numeric","idx":3,"name":"hash_1"}, {"type":"numeric","idx":4,"name":"hash_2"}, {"type":"numeric","idx":5,"name":"hash_3"}, {"type":"numeric","idx":6,"name":"hash_4"}, {"type":"numeric","idx":7,"name":"hash_5"}, {"type":"numeric","idx":8,"name":"hash_6"}, {"type":"numeric","idx":9,"name":"hash_7"}, {"type":"numeric","idx":10,"name":"hash_8"}, {"type":"numeric","idx":11,"name":"hash_9"}, {"type":"numeric","idx":12,"name":"hash_10"}, {"type":"numeric","idx":13,"name":"hash_11"}, {"type":"numeric","idx":14,"name":"hash_12"}, {"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam... you can see that the attributes are given the column name followed by the index. This seems like a bug in the VectorAssembler, because it is making the schema dense when it should be sparse, but regardless this seems to be the more official way to represent the name of the attributes instead of using a "V" followed by index - unless you have seen the "V" + index used elsewhere? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16630#discussion_r100927396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -1152,4 +1170,32 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( "No p-value available for this GeneralizedLinearRegressionModel") } } + + /** + * Summary table with feature name, coefficient, standard error, + * tValue and pValue. + */ + @Since("2.2.0") + lazy val summaryTable: DataFrame = { +if (isNormalSolver) { + var featureNames = featureName + var coefficients = model.coefficients.toArray + var idx = Array.range(0, coefficients.length) + if (model.getFitIntercept) { +featureNames = featureNames :+ "Intercept" +coefficients = coefficients :+ model.intercept +// Reorder so that intercept comes first +idx = (coefficients.length - 1) +: idx + } + val result = for (i <- idx.toSeq) yield +(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) + + val spark = SparkSession.builder().getOrCreate() + import spark.implicits._ --- End diff -- minor comment: might it be better to simplify this as "import dataset.sparkSession.implicits._", or is there a reason to prefer the SparkSession.builder().getOrCreate()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/16630 [SPARK-19270][ML] Add summary table to GLM summary ## What changes were proposed in this pull request? Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. @srowen @yanboliang ## How was this patch tested? New tests. One for testing feature Name, and one for testing the summary Table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark glmTable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16630.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16630 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org