
I am trying to run the ML Binary Evaluation Classifier metrics to compare
the rating with predicted values and get the AreaROC.

My dataframe has two columns with rating as int (I have binarized it) and
predicitions which is a float.

When I pass it to the ML evaluator method I get an error as shown below:

Can someone help me with gettng this sorted out?. Appreciate all the help

Stackoverflow post:


I was trying to use the pyspark.ml.evaluation Binary classification metric
like below

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")

print evaluator.evaluate(predictions)

My Predictions data frame looks like this:




|rating|  prediction|


|     1|  0.14829934|

|     1|-0.017862909|

|     1|   0.4951505|

|     1|0.0074382657|

|     1|-0.002562912|

|     1|   0.0208337|

|     1| 0.049362548|

|     1|  0.09693333|

|     1|  0.17998546|

|     1| 0.019649783|

|     1| 0.031353004|

|     1|  0.03657037|

|     1|  0.23280995|

|     1| 0.033190556|

|     1|  0.35569906|

|     1| 0.030974165|

|     1|   0.1422375|

|     1|  0.19786166|

|     1|  0.07740938|

|     1|  0.33970386|


only showing top 20 rows

The datatype of each column is as follows:



 |-- rating: integer (nullable = true)

 |-- prediction: float (nullable = true)

Now I get an error with above Ml code saying prediction column is Float and
expected a VectorUDT.

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
__call__(self, *args)

    811         answer = self.gateway_client.send_command(command)

    812         return_value = get_return_value(

--> 813             answer, self.gateway_client, self.target_id, self.name)


    815         for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)

     51                 raise AnalysisException(s.split(': ', 1)[1],

     52             if s.startswith('java.lang.IllegalArgumentException: '):

---> 53                 raise IllegalArgumentException(s.split(': ', 1)[1],

     54             raise

     55     return deco

IllegalArgumentException: u'requirement failed: Column prediction must be
of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually

So I thought of converting the predictions column from float to VectorUDT
as below:

*Applying the schema to the dataframe to convert the float column type to

from pyspark.sql.types import IntegerType, StructType,StructField

schema = StructType([

    StructField("rating", IntegerType, True),

    StructField("prediction", VectorUDT(), True)



But Now I get this error.


AssertionError                            Traceback (most recent call last)

<ipython-input-30-8fce6c4bbeb4> in <module>()


      5 schema = StructType([

----> 6     StructField("rating", IntegerType, True),

      7     StructField("prediction", VectorUDT(), True)

      8 ])

/Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name,
dataType, nullable, metadata)

    401         False

    402         """

--> 403         assert isinstance(dataType, DataType), "dataType should be

    404         if not isinstance(name, str):

    405             name = name.encode('utf-8')

AssertionError: dataType should be DataType

It takes so much time to run an ml algo in spark libraries with so many
weird errors. Even I tried Mllib with RDD data. That is giving the
ValueError: Null pointer exception.


