[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-06-25 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r123925697
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala
 ---
@@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary
  */
 private[evaluation] trait BinaryConfusionMatrix {
   /** number of true positives */
-  def numTruePositives: Long
+  def numTruePositives: Double
--- End diff --

good idea, updated the names of the variables


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-06-25 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r123925523
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") (
   private lazy val (
 cumulativeCounts: RDD[(Double, BinaryLabelCounter)],
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
-// Create a bin for each distinct score value, count positives and 
negatives within each bin,
-// and then sort by score values in descending order.
-val counts = scoreAndLabels.combineByKey(
-  createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) 
+= label,
-  mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
+// Create a bin for each distinct score value, count weighted 
positives and
+// negatives within each bin, and then sort by score values in 
descending order.
+val counts = scoreAndLabelsWithWeights.combineByKey(
+  createCombiner = (labelAndWeight: (Double, Double)) =>
+new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, 
labelAndWeight._2),
--- End diff --

updated, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-06-25 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r123925437
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame
  *partition boundaries.
  */
 @Since("1.0.0")
-class BinaryClassificationMetrics @Since("1.3.0") (
-@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+class BinaryClassificationMetrics @Since("2.2.0") (
+val numBins: Int,
+@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, 
Double))])
+  extends Logging {
 
   require(numBins >= 0, "numBins must be nonnegative")
 
   /**
+   * Retrieves the score and labels (for binary compatibility).
+   * @return The score and labels.
+   */
+  @Since("1.0.0")
--- End diff --

good catch, updated version for both:
1.) def scoreAndLabels: RDD[(Double, Double)] = 
2.) def this(@Since("1.3.0") scoreAndLabels: RDD[(Double, Double)], 
@Since("1.3.0") numBins: Int) = ...
to 1.3.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-06-25 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r123925177
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") 
(@Since("1.4.0") override va
 SchemaUtils.checkNumericType(schema, $(labelCol))
 
 // TODO: When dataset metadata has been implemented, check 
rawPredictionCol vector length = 2.
-val scoreAndLabels =
-  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType)).rdd.map {
-case Row(rawPrediction: Vector, label: Double) => 
(rawPrediction(1), label)
-case Row(rawPrediction: Double, label: Double) => (rawPrediction, 
label)
+val scoreAndLabelsWithWeights =
+  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType),
+if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol)))
--- End diff --

added check for numeric type and did cast to Double, similar to labelCol


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-06-25 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r123924688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType
 @Since("1.2.0")
 @Experimental
 class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
-  extends Evaluator with HasRawPredictionCol with HasLabelCol with 
DefaultParamsWritable {
+  extends Evaluator with HasRawPredictionCol with HasLabelCol
+with HasWeightCol with DefaultParamsWritable {
 
   @Since("1.2.0")
   def this() = this(Identifiable.randomUID("binEval"))
 
   /**
+   * Default number of bins to use for binary classification evaluation.
+   */
+  val defaultNumberOfBins = 1000
--- End diff --

It seemed like a good default value to use - for graphing ROC curve, it's 
not too large for most plots, but it's not so small that the graph would be 
jagged.  The user can always specify a value to override the default.  However, 
it's usually not a good idea to sort over the entire label/score values, since 
the dataset will probably be very large, the operation will be very slow, and 
when visualizing the data there won't be any difference, so by default we 
should try to discourage the user from not down-sampling the number of bins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364047
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") 
(@Since("1.4.0") override va
 SchemaUtils.checkNumericType(schema, $(labelCol))
 
 // TODO: When dataset metadata has been implemented, check 
rawPredictionCol vector length = 2.
-val scoreAndLabels =
-  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType)).rdd.map {
-case Row(rawPrediction: Vector, label: Double) => 
(rawPrediction(1), label)
-case Row(rawPrediction: Double, label: Double) => (rawPrediction, 
label)
+val scoreAndLabelsWithWeights =
+  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType),
+if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol)))
--- End diff --

Check weightCol is double?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364179
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala
 ---
@@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary
  */
 private[evaluation] trait BinaryConfusionMatrix {
   /** number of true positives */
-  def numTruePositives: Long
+  def numTruePositives: Double
--- End diff --

I feel it may be better to create new attributes like 
`weightedTruePositives`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364061
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType
 @Since("1.2.0")
 @Experimental
 class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
-  extends Evaluator with HasRawPredictionCol with HasLabelCol with 
DefaultParamsWritable {
+  extends Evaluator with HasRawPredictionCol with HasLabelCol
+with HasWeightCol with DefaultParamsWritable {
 
   @Since("1.2.0")
   def this() = this(Identifiable.randomUID("binEval"))
 
   /**
+   * Default number of bins to use for binary classification evaluation.
+   */
+  val defaultNumberOfBins = 1000
--- End diff --

Why 1000?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364140
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame
  *partition boundaries.
  */
 @Since("1.0.0")
-class BinaryClassificationMetrics @Since("1.3.0") (
-@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+class BinaryClassificationMetrics @Since("2.2.0") (
+val numBins: Int,
+@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, 
Double))])
+  extends Logging {
 
   require(numBins >= 0, "numBins must be nonnegative")
 
   /**
+   * Retrieves the score and labels (for binary compatibility).
+   * @return The score and labels.
+   */
+  @Since("1.0.0")
--- End diff --

inconsistent annotation. was 1.3.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") (
   private lazy val (
 cumulativeCounts: RDD[(Double, BinaryLabelCounter)],
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
-// Create a bin for each distinct score value, count positives and 
negatives within each bin,
-// and then sort by score values in descending order.
-val counts = scoreAndLabels.combineByKey(
-  createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) 
+= label,
-  mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
+// Create a bin for each distinct score value, count weighted 
positives and
+// negatives within each bin, and then sort by score values in 
descending order.
+val counts = scoreAndLabelsWithWeights.combineByKey(
+  createCombiner = (labelAndWeight: (Double, Double)) =>
+new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, 
labelAndWeight._2),
--- End diff --

`new BinaryLabelCounter(0.0, 0.0)`? Defined to take double parameters 
below. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-02-27 Thread imatiach-msft
GitHub user imatiach-msft opened a pull request:

https://github.com/apache/spark/pull/17084

[SPARK-18693][ML][MLLIB] ML Evaluators should use weight column - added 
weight column for binary classification evaluator

## What changes were proposed in this pull request?

The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and 
MulticlassClassificationEvaluator and the corresponding metrics classes 
BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use 
sample weight data.

I've closed the PR: https://github.com/apache/spark/pull/16557
as recommended in favor of creating three pull requests, one for each of 
the evaluators (binary/regression/multiclass) to make it easier to 
review/update.

## How was this patch tested?
I added tests to the metrics and evaluators classes.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/imatiach-msft/spark ilmat/binary-evalute

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17084.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17084






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org