Hi I am trying to run the ML Binary Evaluation Classifier metrics to compare the rating with predicted values and get the AreaROC.
My dataframe has two columns with rating as int (I have binarized it) and predicitions which is a float. When I pass it to the ML evaluator method I get an error as shown below: Can someone help me with gettng this sorted out?. Appreciate all the help Stackoverflow post: http://stackoverflow.com/questions/40408898/converting-the-float-column-in-spark-dataframe-to-vectorudt I was trying to use the pyspark.ml.evaluation Binary classification metric like below evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction") print evaluator.evaluate(predictions) My Predictions data frame looks like this: predictions.select('rating','prediction') predictions.show() +------+------------+ |rating| prediction| +------+------------+ | 1| 0.14829934| | 1|-0.017862909| | 1| 0.4951505| | 1|0.0074382657| | 1|-0.002562912| | 1| 0.0208337| | 1| 0.049362548| | 1| 0.09693333| | 1| 0.17998546| | 1| 0.019649783| | 1| 0.031353004| | 1| 0.03657037| | 1| 0.23280995| | 1| 0.033190556| | 1| 0.35569906| | 1| 0.030974165| | 1| 0.1422375| | 1| 0.19786166| | 1| 0.07740938| | 1| 0.33970386| +------+------------+ only showing top 20 rows The datatype of each column is as follows: predictions.printSchema() root |-- rating: integer (nullable = true) |-- prediction: float (nullable = true) Now I get an error with above Ml code saying prediction column is Float and expected a VectorUDT. /Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 51 raise AnalysisException(s.split(': ', 1)[1], stackTrace) 52 if s.startswith('java.lang.IllegalArgumentException: '): ---> 53 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 54 raise 55 return deco IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.' So I thought of converting the predictions column from float to VectorUDT as below: *Applying the schema to the dataframe to convert the float column type to VectorUDT* from pyspark.sql.types import IntegerType, StructType,StructField schema = StructType([ StructField("rating", IntegerType, True), StructField("prediction", VectorUDT(), True) ]) predictions_dtype=sqlContext.createDataFrame(prediction,schema) But Now I get this error. --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-30-8fce6c4bbeb4> in <module>() 4 5 schema = StructType([ ----> 6 StructField("rating", IntegerType, True), 7 StructField("prediction", VectorUDT(), True) 8 ]) /Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name, dataType, nullable, metadata) 401 False 402 """ --> 403 assert isinstance(dataType, DataType), "dataType should be DataType" 404 if not isinstance(name, str): 405 name = name.encode('utf-8') AssertionError: dataType should be DataType It takes so much time to run an ml algo in spark libraries with so many weird errors. Even I tried Mllib with RDD data. That is giving the ValueError: Null pointer exception. ᐧ