[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 @icexelloss as a daily user of `pandas_udf`, the inability to use keyword arguments, and the difficulties around default arguments (due in part to the magic that converts string arguments to `pd.series`, which doesn't apply to default args) , are much more annoying to me than the lack of support for partials and callables, which are more peripheral issues. (take as just one data point, certainly, others may have differing opinions.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 Partials (and callable objects) are supported in UDF but not `pandas_udf`; kw args are not supported by either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 Many (though not all, I don't think `callable`s are impacted) of the limitations of pandas_udf relative to UDF in this domain are due to the fact that `pandas_udf` doesn't allow for keyword arguments at the call site. This obviously impacts plain old function-based `pandas_udf`s but also partial fns, where one would typically need to specify the argument (that one was partially applying) by name. In the incremental commits of this PR as at: https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b You can see the kind of things I was investigating to try and fix that case. Indeed my original PR was (ambitiously) titled something about enabling kw args for all pandas_udfs. This is actually very easy to do for *functions* on python3 (and maybe plain functions in py2 also, but I suspect that this is also rather tricky as `getargspec` is pretty unhelpful when it comes to some of the kw-arg metadata one would need)). However, it is rather harder for the partial function case as one quickly gets into stacktraces from places like `python/pyspark/worker.py` where the semantics of the current strategy do not realize that a column from the arguments list may already be "accounted for" and one runs into duplicate arguments passed for `a`, e.g., as a result of this. My summary is that the change to allow kw for functions is simple (at least in py3 -- indeed my incremental commit referenced above does this), but for partial fns maybe not so much. I'm pretty confident I'm most of the way to accomplishing the former, but not that latter. However, I have no substantial knowledge of the pyspark codebase so you will likely have better luck there, should you go down that route :) **TL;DR**: I could work on a PR to allow keyword arguments for python3 functions (not partials, and not py2), but that is likely too narrow a goal given the broader context. One general question: how do we tend to think about the py2/3 split for api quirks/features? Must everything that is added for py3 also be functional in py2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `p...
Github user mstewart141 closed the pull request at: https://github.com/apache/spark/pull/20798 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20798 see https://github.com/apache/spark/pull/20900 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 @HyukjinKwon the old pr: https://github.com/apache/spark/pull/20798 was a disaster from a git-cleanliness perspective so i've updated here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `p...
GitHub user mstewart141 opened a pull request: https://github.com/apache/spark/pull/20900 [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_udf` with keyword args ## What changes were proposed in this pull request? Add documentation about the limitations of `pandas_udf` with keyword arguments and related concepts, like `functools.partial` fn objects. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mstewart141/spark udfkw2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20900 commit 048570f7e5f421288b7c297e4d2e3873626a6adc Author: Michael (Stu) Stewart Date: 2018-03-11T20:38:29Z [SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments commit 9ea2595f0cecb0cd05e0e6b99baf538679332e8b Author: Michael (Stu) Stewart Date: 2018-03-18T18:04:21Z Incomplete / Show issue with partial fn in pandas_udf commit acd1cbe53dc7d1bf83b1022a7e36652cd9530b58 Author: Michael (Stu) Stewart Date: 2018-03-18T18:13:53Z Add note RE no keyword args in python UDFs commit bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c Author: Michael (Stu) Stewart Date: 2018-03-24T17:30:15Z Address comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20798 all that makes sense; i will update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20798 @HyukjinKwon thanks again. i've updated this PR to add documentation. I dug pretty deep into the bigger issue around kwargs/partial functions, and you can see what i did in the commit: https://github.com/apache/spark/pull/20798/commits/969f9073ee06d2a5641f78247b75e30d9ad1679a Basically, throughout the udf and arrow serialization code there is no notion of kwargs as supported, making it more challenging than I anticipated to wire everything together. Definitely not impossible, but not a small undertaking either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20798: [SPARK-23645][PYTHON] Allow python udfs to be called wit...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20798 [WIP] cc @HyukjinKwon ð i'd love to run tests here to make sure i haven't broken something. i will update pr with new tests once i set up testing better on my local box. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20798: [SPARK-23645][PYTHON] Allow python udfs to be cal...
GitHub user mstewart141 opened a pull request: https://github.com/apache/spark/pull/20798 [SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments ## [WIP] ## What changes were proposed in this pull request? Currently one can not pass keyword arguments in python UDFs. This patch allows keyword arguments to be mixed arbitrarily with positional arguments, as seen in normal python functions. UDFs accepting an arbitrary (undefined) number of columns are a different matter, and not addressed here. ## How was this patch tested? I will add unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mstewart141/spark udfkw Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20798.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20798 commit 5ec810a7c36691df1877ffc11e6f06392d438485 Author: Michael (Stu) Stewart Date: 2018-03-11T20:38:29Z [SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20728 your test definitely makes sense; yea the syntax error in py2 part is why i wasn't sure how to go about testing this in the first place. this certainly gets the job done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work wi...
Github user mstewart141 commented on a diff in the pull request: https://github.com/apache/spark/pull/20728#discussion_r172063118 --- Diff: python/pyspark/sql/udf.py --- @@ -42,10 +42,15 @@ def _create_udf(f, returnType, evalType): PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF): import inspect +import sys from pyspark.sql.utils import require_minimum_pyarrow_version require_minimum_pyarrow_version() -argspec = inspect.getargspec(f) + +if sys.version_info[0] < 3: +argspec = inspect.getargspec(f) +else: +argspec = inspect.getfullargspec(f) --- End diff -- can do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20728 what should next step be here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20728 cc @HyukjinKwon ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work wi...
GitHub user mstewart141 opened a pull request: https://github.com/apache/spark/pull/20728 [SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style type-annotated functions ## What changes were proposed in this pull request? Check python version to determine whether to use `inspect.getargspec` or inspect.getfullargspec` before applying `pandas_udf` core logic to a function. The former is python2.7 (deprecated in python3) and the latter is python3.x. The latter correctly accounts for type annotations, which are syntax errors in python2.x. ## How was this patch tested? Locally, on python 2.7 and 3.6. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mstewart141/spark pandas_udf_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20728.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20728 commit 3cd53f39f23ebd1b9b4134a9ac22348b301f8bd4 Author: Michael (Stu) Stewart Date: 2018-03-03T21:54:53Z [SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style type-annotated functions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org