Typically you pass in the result of a model transform to the evaluator. So: val model = estimator.fit(data) val auc = evaluator.evaluate(model.transform(testData)
Check Scala API docs for some details: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma <bhaara...@gmail.com> wrote: Can you please suggest how I can use BinaryClassificationEvaluator? I tried: scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator scala> val evaluator = new BinaryClassificationEvaluator() evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_0d57372b7579 Try 1: scala> evaluator.evaluate(testScoreAndLabel.rdd) <console>:105: error: type mismatch; found : org.apache.spark.rdd.RDD[(Double, Double)] required: org.apache.spark.sql.Dataset[_] evaluator.evaluate(testScoreAndLabel.rdd) Try 2: scala> evaluator.evaluate(testScoreAndLabel) java.lang.IllegalArgumentException: Field "rawPrediction" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228) Try 3: scala> evaluator.evaluate(testScoreAndLabel.select("Label","ModelProbability")) org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given input columns: [_1, _2]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma <bhaara...@gmail.com> wrote: I am getting scala.MatchError in the code below. I'm not able to see why this would be happening. I am using Spark 2.0.1 scala> testResults.columns res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, rawPrediction, ModelProbability, ModelPrediction) scala> testResults.select("Label","ModelProbability").take(1) res542: Array[org.apache.spark.sql.Row] = Array([0.0,[0.737304818744076,0.262695181255924]]) scala> val testScoreAndLabel = testResults. | select("Label","ModelProbability"). | map { case Row(l:Double, p:Vector) => (p(1), l) } testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel.columns res540: Array[String] = Array(_1, _2) scala> val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd) testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 The code below gives the error val auROC = testMetrics.areaUnderROC() //this line gives the error Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)