Omri created SPARK-23929: ---------------------------- Summary: pandas_udf schema mapped by position and not by name Key: SPARK-23929 URL: https://issues.apache.org/jira/browse/SPARK-23929 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Environment: PySpark
Spark 2.3.0 Reporter: Omri The return struct of a pandas_udf should be mapped to the provided schema by name. Currently it's not the case. Consider these two examples, where the only change is the order of the fields in the provided schema struct: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).show() {code} and this one: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).show() {code} The results should be the same but they are different: For the first code: {code:java} +---+---+ | v| id| +---+---+ |1.0| 0| |1.0| 0| |2.0| 0| |2.0| 0| |2.0| 1| +---+---+ {code} For the second code: {code:java} +---+-------------------+ | id| v| +---+-------------------+ | 1|-0.7071067811865475| | 1| 0.7071067811865475| | 2|-0.8320502943378437| | 2|-0.2773500981126146| | 2| 1.1094003924504583| +---+-------------------+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org