subject:"\[jira\] \[Assigned\] \(SPARK\-25351\) Handle Pandas category type when converting from Python with Arrow"

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25351:


Assignee: Jalpan Randeri

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-04-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25351:


Assignee: (was: Apache Spark)

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-04-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25351:


Assignee: Apache Spark

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

3 matches

Site Navigation

Mail list logo

Footer information