[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-26 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453880#comment-16453880
 ] 

Tr3wory commented on SPARK-23929:
-

Yes, the documentation is a must, even for 2.3 if possible (it's a new feature 
of 2.3, right?).

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452282#comment-16452282
 ] 

Tr3wory commented on SPARK-23929:
-

I think the problem is even more nuanced: in python the order of the values in 
a dict are not defined, so generally if you create a new DataFrame in your 
function from a dict (like in the previous example) you need to specify the 
order manually.

So
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP) 
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "ones":1, "zeros":0},index = [0])

df.groupby("id").apply(constants).show()
{code}
and
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "zeros":0, "ones":1},index = 
[0])

df.groupby("id").apply(constants).show()
{code}
gives you the exact same wrong results.
 The fact that this happens without any error or warning is even more worrying 
(but only if the types are compatible, if any of them a string, you get strange 
errors).

You need to specify the order manually:
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "zeros":0, "ones":1}, 
columns=['id', 'zeros', 'ones'], index=[0])

df.groupby("id").apply(constants).show()
{code}
I think the current implementation is hard to use correctly and very easy to 
use incorrectly which greatly outweighs the small flexibility of not specify 
the names...

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452459#comment-16452459
 ] 

Tr3wory commented on SPARK-23929:
-

Yes, but that's not simpler than using "columns=[...]":
{code:python}
from collections import OrderedDict

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame(OrderedDict([("id",grp.iloc[0]['id']), ("zeros",0), 
("ones",1)]),index = [0])

df.groupby("id").apply(constants).show()
{code}

 

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org