[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-12-30 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005500#comment-17005500
 ] 

Maciej Szymkiewicz commented on SPARK-28264:


Thanks [~hyukjin.kwon].

In general I think that this proposal is pretty good, as long as this variant 
of the API is made optional and alternative, old-style path, is provided (this 
is for example how 
[{{@functools.singledispatch}}|https://docs.python.org/3.8/library/functools.html#functools.singledispatch]
 works since Python 3.7). 


> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
> -See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]-
>  New proposal: 
> https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-12-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005228#comment-17005228
 ] 

Hyukjin Kwon commented on SPARK-28264:
--

I came up with a new proposal. Please take a look guys if you guys fine some 
time.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
> -See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]-
>  New proposal: 
> https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-12-08 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990958#comment-16990958
 ] 

Reynold Xin commented on SPARK-28264:
-

Sounds good. Thanks for doing this [~hyukjin.kwon]!

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990872#comment-16990872
 ] 

Hyukjin Kwon commented on SPARK-28264:
--

[~rxin], I sent an email to dev list but leaving a comment here as well to make 
sure.
I'll take over this since it's kind of stuck in the middle for now. I think 
it's worthy to make changes within 3.0.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893209#comment-16893209
 ] 

Bryan Cutler commented on SPARK-28264:
--

It's great to be taking another look at this, I think some aspects are really 
confusing. I left some comments in the doc, but to sum it up I think anything 
we can do to reduce the number of arguments and options will make it more user 
friendly. I worry that while replacing the pandas udf types with other options 
would make things more flexible, I'm not sure it makes it any easier to 
understand.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-15 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885586#comment-16885586
 ] 

Holden Karau commented on SPARK-28264:
--

I'd love to know [~bryanc]'s thoughts on this proposal as well, I'm sure he has 
something interesting to say.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-15 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885581#comment-16885581
 ] 

Holden Karau commented on SPARK-28264:
--

Maybe I missed it in the document, but I feel like I don't currently see a 
soluttion to problem #6 (" GROUPED_MAP, GROUPED_AGG and COGROUPED_MAP are 
potentially a recipe for disaster,"). I left some suggestions but there early 
thoughts. I think it's a worthwhile thing to keep in mind. Ideally our APIs 
should make scalable solutions the path of least resistance.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-11 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882774#comment-16882774
 ] 

Maciej Szymkiewicz commented on SPARK-28264:


Personally I fail to see why some UDF types are needed at all. With "classic"  
UDFs (standard Python UDF, SCALAR, GROUPED_AGG) intended for use with SQL API 
(GroupedData.agg, DataFrame.select, ...) returned object could be described as:

 
{code:python}
class UserDefinedFunctionLike(Protocol):
def __call__(self, *_: ColumnOrName) -> Column: ...
{code}

That makes perfect sense as the result is called explicitly by the end user. 
Additionally, metadata has to be attached to such UDF, to be compatible with 
SQL API.

Now GROUPED_MAP, MAP_ITER and COGROUPED_MAP are quite different beasts, are not 
intended to be called directly by the end user.  The API is still the same, one 
can even call the object and get a Column, but such Column is useless in the 
current API. That's confusing at best.

Unlike "SQL" UDF's, MAP_* and *_MAP are used in specialized contexts. This 
creates an opportunity. If we think about the functions with following 
interfaces


{code:python}
class PandasGroupedMapFunction(Protocol):
def __call__(self, _: pandas.core.frame.DataFrame) -> 
pandas.core.frame.DataFrame: ...

class PandasMapIterFunction(Protocol):
def __call__(self, _: Generator[pandas.core.frame.DataFrame]) -> 
Generator[pandas.core.frame.DataFrame]: ...

class PandasCogroupFunction(Protocol):
def __call__(self, left: pandas.core.frame.DataFrame, right: 
pandas.core.frame.DataFrame) -> Generator[pandas.core.frame.DataFrame]: ...
{code}
 

Then UDFish metadata can be provided directly in apply / mapPartitionsInPandas 
methods:



{code:python}
class CoGroupedData():
def apply(self, udf: PandasCogroupFunction, returnType: DataTypeOrString) 
-> DataFrame: ...

class GroupedData():
def apply(self, udf: PandasGroupedMapFunction, returnType: 
DataTypeOrString) -> DataFrame: ...
def  mapPartitionsInPandas(self, PandasMapIterFunction,  returnType: 
DataTypeOrString) -> DataFrame: 
{code}

Such design could curb proliferation of UDF types, and provide clear 
distinction between SQL and non-SQL API.

That leaves us with SCALAR_ITER thingy... If the current interface is to be 
preserved, one possible solution is to choose convention over configuration:


{code:python}
class PandasScalarFunction(Protocol):
def __call__(self, *_: pandas.core.series.Series) -> 
pandas.core.series.Series: ...

class PandasIterScalarFunction(Protocol):
def __call__(self, _: Iterator[pandas.core.series.Series]) -> 
Generator[pandas.core.series.Series]: ...  # Note Generator in the return type.
{code}

In such case we could avoid exposing new UDF types and simply distinguish 
between both cases based on the type of the input objects:


{code:python}
import inspect

if functionType is PandasUDFType.SCALAR and inspect.isgeneratorfunction(f):
...  # Go SQL_SCALAR_PANDAS_ITER_UDF path

elif functionType is PandasUDFType.SCALAR:
   ...  # Go SQL_SCALAR_PANDAS_UDF path
{code}

Similar approach can be used if other _ITER variants are introduced in the 
future.

If API is to be simplified:

{code:python}
class PandasIterScalarFunction(Protocol):
def __call__(self, _:pandas.core.series.Series) -> 
pandas.core.series.Series: ... 
{code}

_ITER could be simply pushed into pandas_udf signature:

{code:python}
def pandas_udf(f=None, returnType=None, functionType=None, iterate=False): ...
{code}



> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879980#comment-16879980
 ] 

Sean Owen commented on SPARK-28264:
---

I generally like the rationalization of the various UDF types, as they do 
different things, things which aren't so obvious from the names. Anything we 
can do to clarify is a win.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org