Dong Wang created SPARK-29812:
---------------------------------

             Summary: Missing persist on predictionAndLabels in 
MulticlassClassificationEvaluator
                 Key: SPARK-29812
                 URL: https://issues.apache.org/jira/browse/SPARK-29812
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.4.3
            Reporter: Dong Wang


The rdd predictionAndLabels in 
ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. 
When MulticlassMetrics uses predictionAndLabels to initialize fileds, there 
will be at least five actions executed on predictionAndLabels.
{code:scala}
  override def evaluate(dataset: Dataset[_]): Double = {
    val schema = dataset.schema
    SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
    SchemaUtils.checkNumericType(schema, $(labelCol))
    // Needs to be persisted
    val predictionAndLabels =
      dataset.select(col($(predictionCol)), 
col($(labelCol)).cast(DoubleType)).rdd.map {
        case Row(prediction: Double, label: Double) => (prediction, label)
      }
    // The initialization will use predictionAndLabels multi times in different 
actions.
    val metrics = new MulticlassMetrics(predictionAndLabels)
    val metric = $(metricName) match {
      case "f1" => metrics.weightedFMeasure
      case "weightedPrecision" => metrics.weightedPrecision
      case "weightedRecall" => metrics.weightedRecall
      case "accuracy" => metrics.accuracy
    }
    metric
  }
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to