[jira] [Created] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

Pavel Ganelin (Jira) Wed, 24 Feb 2021 07:03:13 -0800

Pavel Ganelin created SPARK-34521:
-------------------------------------

             Summary: spark.createDataFrame does not support Pandas StringDtype 
extension type
                 Key: SPARK-34521
                 URL: https://issues.apache.org/jira/browse/SPARK-34521
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.1
            Reporter: Pavel Ganelin



The following test case demonstrates the problem:
{code:java}
import pandas as pd
from pyspark.sql import SparkSession, types

spark = SparkSession.builder.appName(__file__)\
    .config("spark.sql.execution.arrow.pyspark.enabled","true") \
    .getOrCreate()

good = pd.DataFrame([["abc"]], columns=["col"])

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(good, schema=schema)

df.show()

bad = good.copy()
bad["col"]=bad["col"].astype("string")

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(bad, schema=schema)

df.show(){code}
The error:
{code:java}
C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Cannot specify a mask or a size when passing an object that is converted with 
the __arrow_array__ protocol.
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

Reply via email to