Bryan Cutler created SPARK-29040: ------------------------------------ Summary: Support pyspark.createDataFrame from a pyarrow.Table Key: SPARK-29040 URL: https://issues.apache.org/jira/browse/SPARK-29040 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Bryan Cutler
PySpark {{createDataFrame}} currently supports creating a spark DataFrame from Pandas, using Arrow if enabled. This could be extended to accept a {{pyarrow.Table}} which has the added benefit of being able to efficiently use columns with nested struct types. It is possible to convert a pyarrow.Table with nested columns into a pandas.DataFrame, but the data becomes dictionaries, and is not a performant way to parallelize the data. Time/Date columns would need to be handled specially, since pyspark currently uses pandas to convert Arrow data of these types to the required Spark internal format. This follows from a mailing list discussion at http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org