[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-25351: Assignee: Jalpan Randeri > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Jalpan Randeri >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25351: Assignee: (was: Apache Spark) > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25351: Assignee: Apache Spark > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org