[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r123925697 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala --- @@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary */ private[evaluation] trait BinaryConfusionMatrix { /** number of true positives */ - def numTruePositives: Long + def numTruePositives: Double --- End diff -- good idea, updated the names of the variables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r123925523 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") ( private lazy val ( cumulativeCounts: RDD[(Double, BinaryLabelCounter)], confusions: RDD[(Double, BinaryConfusionMatrix)]) = { -// Create a bin for each distinct score value, count positives and negatives within each bin, -// and then sort by score values in descending order. -val counts = scoreAndLabels.combineByKey( - createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label, - mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, +// Create a bin for each distinct score value, count weighted positives and +// negatives within each bin, and then sort by score values in descending order. +val counts = scoreAndLabelsWithWeights.combineByKey( + createCombiner = (labelAndWeight: (Double, Double)) => +new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, labelAndWeight._2), --- End diff -- updated, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r123925437 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame *partition boundaries. */ @Since("1.0.0") -class BinaryClassificationMetrics @Since("1.3.0") ( -@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +class BinaryClassificationMetrics @Since("2.2.0") ( +val numBins: Int, +@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, Double))]) + extends Logging { require(numBins >= 0, "numBins must be nonnegative") /** + * Retrieves the score and labels (for binary compatibility). + * @return The score and labels. + */ + @Since("1.0.0") --- End diff -- good catch, updated version for both: 1.) def scoreAndLabels: RDD[(Double, Double)] = 2.) def this(@Since("1.3.0") scoreAndLabels: RDD[(Double, Double)], @Since("1.3.0") numBins: Int) = ... to 1.3.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r123925177 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override va SchemaUtils.checkNumericType(schema, $(labelCol)) // TODO: When dataset metadata has been implemented, check rawPredictionCol vector length = 2. -val scoreAndLabels = - dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map { -case Row(rawPrediction: Vector, label: Double) => (rawPrediction(1), label) -case Row(rawPrediction: Double, label: Double) => (rawPrediction, label) +val scoreAndLabelsWithWeights = + dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType), +if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))) --- End diff -- added check for numeric type and did cast to Double, similar to labelCol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r123924688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType @Since("1.2.0") @Experimental class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Evaluator with HasRawPredictionCol with HasLabelCol with DefaultParamsWritable { + extends Evaluator with HasRawPredictionCol with HasLabelCol +with HasWeightCol with DefaultParamsWritable { @Since("1.2.0") def this() = this(Identifiable.randomUID("binEval")) /** + * Default number of bins to use for binary classification evaluation. + */ + val defaultNumberOfBins = 1000 --- End diff -- It seemed like a good default value to use - for graphing ROC curve, it's not too large for most plots, but it's not so small that the graph would be jagged. The user can always specify a value to override the default. However, it's usually not a good idea to sort over the entire label/score values, since the dataset will probably be very large, the operation will be very slow, and when visualizing the data there won't be any difference, so by default we should try to discourage the user from not down-sampling the number of bins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364047 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override va SchemaUtils.checkNumericType(schema, $(labelCol)) // TODO: When dataset metadata has been implemented, check rawPredictionCol vector length = 2. -val scoreAndLabels = - dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map { -case Row(rawPrediction: Vector, label: Double) => (rawPrediction(1), label) -case Row(rawPrediction: Double, label: Double) => (rawPrediction, label) +val scoreAndLabelsWithWeights = + dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType), +if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))) --- End diff -- Check weightCol is double? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364179 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala --- @@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary */ private[evaluation] trait BinaryConfusionMatrix { /** number of true positives */ - def numTruePositives: Long + def numTruePositives: Double --- End diff -- I feel it may be better to create new attributes like `weightedTruePositives` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364061 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType @Since("1.2.0") @Experimental class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Evaluator with HasRawPredictionCol with HasLabelCol with DefaultParamsWritable { + extends Evaluator with HasRawPredictionCol with HasLabelCol +with HasWeightCol with DefaultParamsWritable { @Since("1.2.0") def this() = this(Identifiable.randomUID("binEval")) /** + * Default number of bins to use for binary classification evaluation. + */ + val defaultNumberOfBins = 1000 --- End diff -- Why 1000? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364140 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame *partition boundaries. */ @Since("1.0.0") -class BinaryClassificationMetrics @Since("1.3.0") ( -@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +class BinaryClassificationMetrics @Since("2.2.0") ( +val numBins: Int, +@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, Double))]) + extends Logging { require(numBins >= 0, "numBins must be nonnegative") /** + * Retrieves the score and labels (for binary compatibility). + * @return The score and labels. + */ + @Since("1.0.0") --- End diff -- inconsistent annotation. was 1.3.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364224 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") ( private lazy val ( cumulativeCounts: RDD[(Double, BinaryLabelCounter)], confusions: RDD[(Double, BinaryConfusionMatrix)]) = { -// Create a bin for each distinct score value, count positives and negatives within each bin, -// and then sort by score values in descending order. -val counts = scoreAndLabels.combineByKey( - createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label, - mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, +// Create a bin for each distinct score value, count weighted positives and +// negatives within each bin, and then sort by score values in descending order. +val counts = scoreAndLabelsWithWeights.combineByKey( + createCombiner = (labelAndWeight: (Double, Double)) => +new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, labelAndWeight._2), --- End diff -- `new BinaryLabelCounter(0.0, 0.0)`? Defined to take double parameters below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
GitHub user imatiach-msft opened a pull request: https://github.com/apache/spark/pull/17084 [SPARK-18693][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. ## How was this patch tested? I added tests to the metrics and evaluators classes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/imatiach-msft/spark ilmat/binary-evalute Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17084.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17084 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org