Guilherme Berger created SPARK-22566:
----------------------------------------

             Summary: Better error message for `_merge_type` in Pandas to Spark 
DF conversion
                 Key: SPARK-22566
                 URL: https://issues.apache.org/jira/browse/SPARK-22566
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.2.0
            Reporter: Guilherme Berger
            Priority: Minor


When creating a Spark DF from a Pandas DF without specifying a schema, schema 
inference is used. This inference can fail when a column contains values of two 
different types; this is ok. The problem is the error message does not tell us 
in which column this happened.

When this happens, it is painful to debug since the error message is too vague.

I plan on submitting a PR which fixes this, providing a better error message 
for such cases, containing the column name (and possibly the problematic values 
too).

>>> spark_session.createDataFrame(pandas_df)
File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
      rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
      struct = self._inferSchemaFromList(data)
File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
      schema = reduce(_merge_type, map(_infer_schema, data))
File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
      for f in a.fields]
File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
      raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 
'pyspark.sql.types.StringType'>
                                                                           




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to