[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005500#comment-17005500 ] Maciej Szymkiewicz commented on SPARK-28264: Thanks [~hyukjin.kwon]. In general I think that this proposal is pretty good, as long as this variant of the API is made optional and alternative, old-style path, is provided (this is for example how [{{@functools.singledispatch}}|https://docs.python.org/3.8/library/functools.html#functools.singledispatch] works since Python 3.7). > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > -See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]- > New proposal: > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005228#comment-17005228 ] Hyukjin Kwon commented on SPARK-28264: -- I came up with a new proposal. Please take a look guys if you guys fine some time. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > -See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]- > New proposal: > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990958#comment-16990958 ] Reynold Xin commented on SPARK-28264: - Sounds good. Thanks for doing this [~hyukjin.kwon]! > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990872#comment-16990872 ] Hyukjin Kwon commented on SPARK-28264: -- [~rxin], I sent an email to dev list but leaving a comment here as well to make sure. I'll take over this since it's kind of stuck in the middle for now. I think it's worthy to make changes within 3.0. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893209#comment-16893209 ] Bryan Cutler commented on SPARK-28264: -- It's great to be taking another look at this, I think some aspects are really confusing. I left some comments in the doc, but to sum it up I think anything we can do to reduce the number of arguments and options will make it more user friendly. I worry that while replacing the pandas udf types with other options would make things more flexible, I'm not sure it makes it any easier to understand. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885586#comment-16885586 ] Holden Karau commented on SPARK-28264: -- I'd love to know [~bryanc]'s thoughts on this proposal as well, I'm sure he has something interesting to say. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885581#comment-16885581 ] Holden Karau commented on SPARK-28264: -- Maybe I missed it in the document, but I feel like I don't currently see a soluttion to problem #6 (" GROUPED_MAP, GROUPED_AGG and COGROUPED_MAP are potentially a recipe for disaster,"). I left some suggestions but there early thoughts. I think it's a worthwhile thing to keep in mind. Ideally our APIs should make scalable solutions the path of least resistance. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882774#comment-16882774 ] Maciej Szymkiewicz commented on SPARK-28264: Personally I fail to see why some UDF types are needed at all. With "classic" UDFs (standard Python UDF, SCALAR, GROUPED_AGG) intended for use with SQL API (GroupedData.agg, DataFrame.select, ...) returned object could be described as: {code:python} class UserDefinedFunctionLike(Protocol): def __call__(self, *_: ColumnOrName) -> Column: ... {code} That makes perfect sense as the result is called explicitly by the end user. Additionally, metadata has to be attached to such UDF, to be compatible with SQL API. Now GROUPED_MAP, MAP_ITER and COGROUPED_MAP are quite different beasts, are not intended to be called directly by the end user. The API is still the same, one can even call the object and get a Column, but such Column is useless in the current API. That's confusing at best. Unlike "SQL" UDF's, MAP_* and *_MAP are used in specialized contexts. This creates an opportunity. If we think about the functions with following interfaces {code:python} class PandasGroupedMapFunction(Protocol): def __call__(self, _: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame: ... class PandasMapIterFunction(Protocol): def __call__(self, _: Generator[pandas.core.frame.DataFrame]) -> Generator[pandas.core.frame.DataFrame]: ... class PandasCogroupFunction(Protocol): def __call__(self, left: pandas.core.frame.DataFrame, right: pandas.core.frame.DataFrame) -> Generator[pandas.core.frame.DataFrame]: ... {code} Then UDFish metadata can be provided directly in apply / mapPartitionsInPandas methods: {code:python} class CoGroupedData(): def apply(self, udf: PandasCogroupFunction, returnType: DataTypeOrString) -> DataFrame: ... class GroupedData(): def apply(self, udf: PandasGroupedMapFunction, returnType: DataTypeOrString) -> DataFrame: ... def mapPartitionsInPandas(self, PandasMapIterFunction, returnType: DataTypeOrString) -> DataFrame: {code} Such design could curb proliferation of UDF types, and provide clear distinction between SQL and non-SQL API. That leaves us with SCALAR_ITER thingy... If the current interface is to be preserved, one possible solution is to choose convention over configuration: {code:python} class PandasScalarFunction(Protocol): def __call__(self, *_: pandas.core.series.Series) -> pandas.core.series.Series: ... class PandasIterScalarFunction(Protocol): def __call__(self, _: Iterator[pandas.core.series.Series]) -> Generator[pandas.core.series.Series]: ... # Note Generator in the return type. {code} In such case we could avoid exposing new UDF types and simply distinguish between both cases based on the type of the input objects: {code:python} import inspect if functionType is PandasUDFType.SCALAR and inspect.isgeneratorfunction(f): ... # Go SQL_SCALAR_PANDAS_ITER_UDF path elif functionType is PandasUDFType.SCALAR: ... # Go SQL_SCALAR_PANDAS_UDF path {code} Similar approach can be used if other _ITER variants are introduced in the future. If API is to be simplified: {code:python} class PandasIterScalarFunction(Protocol): def __call__(self, _:pandas.core.series.Series) -> pandas.core.series.Series: ... {code} _ITER could be simply pushed into pandas_udf signature: {code:python} def pandas_udf(f=None, returnType=None, functionType=None, iterate=False): ... {code} > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879980#comment-16879980 ] Sean Owen commented on SPARK-28264: --- I generally like the rationalization of the various UDF types, as they do different things, things which aren't so obvious from the names. Anything we can do to clarify is a win. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org