[jira] [Created] (SPARK-23929) pandas_udf schema mapped by position and not by name

Omri (JIRA) Sun, 08 Apr 2018 23:52:21 -0700

Omri created SPARK-23929:
----------------------------

             Summary: pandas_udf schema mapped by position and not by name
                 Key: SPARK-23929
                 URL: https://issues.apache.org/jira/browse/SPARK-23929
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
         Environment: PySpark


Spark 2.3.0

 
            Reporter: Omri


The return struct of a pandas_udf should be mapped to the provided schema by 
name. Currently it's not the case. Consider these two examples, where the only 
change is the order of the fields in the provided schema struct:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show() 
{code}
and this one:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show()
{code}
The results should be the same but they are different:

For the first code:
{code:java}
+---+---+
|  v| id|
+---+---+
|1.0|  0|
|1.0|  0|
|2.0|  0|
|2.0|  0|
|2.0|  1|
+---+---+
{code}
For the second code:
{code:java}
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+


{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23929) pandas_udf schema mapped by position and not by name

Reply via email to