Fabian Boehnlein created SPARK-6573:
---------------------------------------

             Summary: expect pandas null values as numpy.nan (not only as None)
                 Key: SPARK-6573
                 URL: https://issues.apache.org/jira/browse/SPARK-6573
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.3.0
            Reporter: Fabian Boehnlein


In pandas it is common to use numpy.nan as the null value, for missing data or 
whatever.

http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna

createDataFrame however only works with None as null values, parsing them as 
None in the RDD.

I suggest to add support for np.nan values in pandas DataFrames.

current stracktrace when calling a DataFrame with object type columns with 
np.nan values (which are floats)
{code}
TypeError                                 Traceback (most recent call last)
<ipython-input-38-34f0263f0bf4> in <module>()
----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
createDataFrame(self, data, schema, samplingRatio)
    339             schema = self._inferSchema(data.map(lambda r: row_cls(*r)), 
samplingRatio)
    340 
--> 341         return self.applySchema(data, schema)
    342 
    343     def registerDataFrameAsTable(self, rdd, tableName):

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
applySchema(self, rdd, schema)
    246 
    247         for row in rows:
--> 248             _verify_type(row, schema)
    249 
    250         # convert python objects to sql data

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
_verify_type(obj, dataType)
   1064                              "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1065         for v, f in zip(obj, dataType.fields):
-> 1066             _verify_type(v, f.dataType)
   1067 
   1068 _cached_cls = weakref.WeakValueDictionary()

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
_verify_type(obj, dataType)
   1048     if type(obj) not in _acceptable_types[_type]:
   1049         raise TypeError("%s can not accept object in type %s"
-> 1050                         % (dataType, type(obj)))
   1051 
   1052     if isinstance(dataType, ArrayType):

TypeError: StringType can not accept object in type <type 'float'>{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to