Guilherme Berger created SPARK-22566: ----------------------------------------
Summary: Better error message for `_merge_type` in Pandas to Spark DF conversion Key: SPARK-22566 URL: https://issues.apache.org/jira/browse/SPARK-22566 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: Guilherme Berger Priority: Minor When creating a Spark DF from a Pandas DF without specifying a schema, schema inference is used. This inference can fail when a column contains values of two different types; this is ok. The problem is the error message does not tell us in which column this happened. When this happens, it is painful to debug since the error message is too vague. I plan on submitting a PR which fixes this, providing a better error message for such cases, containing the column name (and possibly the problematic values too). >>> spark_session.createDataFrame(pandas_df) File "redacted/pyspark/sql/session.py", line 541, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal struct = self._inferSchemaFromList(data) File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList schema = reduce(_merge_type, map(_infer_schema, data)) File "redacted/pyspark/sql/types.py", line 1124, in _merge_type for f in a.fields] File "redacted/pyspark/sql/types.py", line 1118, in _merge_type raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) TypeError: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'> -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org