Dong Wang created SPARK-29812: --------------------------------- Summary: Missing persist on predictionAndLabels in MulticlassClassificationEvaluator Key: SPARK-29812 URL: https://issues.apache.org/jira/browse/SPARK-29812 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.3 Reporter: Dong Wang
The rdd predictionAndLabels in ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. When MulticlassMetrics uses predictionAndLabels to initialize fileds, there will be at least five actions executed on predictionAndLabels. {code:scala} override def evaluate(dataset: Dataset[_]): Double = { val schema = dataset.schema SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType) SchemaUtils.checkNumericType(schema, $(labelCol)) // Needs to be persisted val predictionAndLabels = dataset.select(col($(predictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) } // The initialization will use predictionAndLabels multi times in different actions. val metrics = new MulticlassMetrics(predictionAndLabels) val metric = $(metricName) match { case "f1" => metrics.weightedFMeasure case "weightedPrecision" => metrics.weightedPrecision case "weightedRecall" => metrics.weightedRecall case "accuracy" => metrics.accuracy } metric } {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org