[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-27 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
@icexelloss as a daily user of `pandas_udf`, the inability to use keyword 
arguments, and the difficulties around default arguments (due in part to the 
magic that converts string arguments to `pd.series`, which doesn't apply to 
default args) , are much more annoying to me than the lack of support for 
partials and callables, which are more peripheral issues. 

(take as just one data point, certainly, others may have differing 
opinions.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-27 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20900
  
Created https://issues.apache.org/jira/browse/SPARK-23800


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-27 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20900
  
@HyukjinKwon Thanks for the explanation. I will create Jira for partial 
functions and callable objects in Pandas UDF. I am happy to take a look at it.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
The issue itself here (SPARK-23645) describes kwargs arguments support in 
both UDF and Pandas UDF on calling side. Seems not working but the fix looks 
going to be quite invasive and big. So, I suggested to fix the documentation 
for now. Maybe, we should revisit in the future. Let's monitor mailing list and 
JIRAs.

https://github.com/apache/spark/pull/20900#issuecomment-376356469 with 
https://github.com/apache/spark/pull/20900#issuecomment-376357750 is a separate 
issue about partial functions and callable objects in Pandas UDF, I found 
during review.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
to be clear, I think both functions below

```python
class F(object):
def __call__(...):
...

func = F()
```

```python
def naive_func(a, b):
...

func = partial(naive_func, a=1)
```

should work woth Pandas UDF but seems not working given my test 
https://github.com/apache/spark/pull/20900#issuecomment-375949480


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
@icexelloss, yup ^ is correct. IIRC, we have some tests for normal udfs 
with callable objects and partial functions separately but seems the problem is 
in Pandas UDF. I think the fix itself will relativrly minimal (just from my 
wild guess).

would you be inretested in doing this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
Partials (and callable objects) are supported in UDF but not `pandas_udf`; 
kw args are not supported by either.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20900
  
Thank you @mstewart141 for looking into this.

@HyukjinKwon should we open Jira for supporting kw args and partial 
functions in python UDFs? If I understand correctly, this is related to both 
regular python UDFs and pandas UDFs, is that right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
Merged to master and branch-2.3 anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
I think we should generally make everything works in both Python 2 and 
Python 3 but I want to know if there are special chases that I am missing too 
if there are any.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88573/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20900
  
**[Test build #88573 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88573/testReport)**
 for PR 20900 at commit 
[`a3da39c`](https://github.com/apache/spark/commit/a3da39ca62f69fd4e3a4c417ed28613edd15924f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20900
  
> One general question: how do we tend to think about the py2/3 split for 
api quirks/features? Must everything that is added for py3 also be functional 
in py2?

ideally, is there something you have in mind that would not work in py2?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
Many (though not all, I don't think `callable`s are impacted) of the 
limitations of pandas_udf relative to UDF in this domain are due to the fact 
that `pandas_udf` doesn't allow for keyword arguments at the call site. This 
obviously impacts plain old function-based `pandas_udf`s but also partial fns, 
where one would typically need to specify the argument (that one was partially 
applying) by name.

In the incremental commits of this PR as at:

https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b

You can see the kind of things I was investigating to try and fix that 
case. Indeed my original PR was (ambitiously) titled something about enabling 
kw args for all pandas_udfs. This is actually very easy to do for *functions* 
on python3 (and maybe plain functions in py2 also, but I suspect that this is 
also rather tricky as `getargspec` is pretty unhelpful when it comes to some of 
the kw-arg metadata one would need)). However, it is rather harder for the 
partial function case as one quickly gets into stacktraces from places like 
`python/pyspark/worker.py` where the semantics of the current strategy do not 
realize that a column from the arguments list may already be "accounted for" 
and one runs into duplicate arguments passed for `a`, e.g., as a result of 
this. 

My summary is that the change to allow kw for functions is simple (at least 
in py3 -- indeed my incremental commit referenced above does this), but for 
partial fns maybe not so much. I'm pretty confident I'm most of the way to 
accomplishing the former, but not that latter.

However, I have no substantial knowledge of the pyspark codebase so you 
will likely have better luck there, should you go down that route :)

**TL;DR**: I could work on a PR to allow keyword arguments for python3 
functions (not partials, and not py2), but that is likely too narrow a goal 
given the broader context.

One general question: how do we tend to think about the py2/3 split for api 
quirks/features? Must everything that is added for py3 also be functional in 
py2?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20900
  
**[Test build #88573 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88573/testReport)**
 for PR 20900 at commit 
[`a3da39c`](https://github.com/apache/spark/commit/a3da39ca62f69fd4e3a4c417ed28613edd15924f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
LGTM except https://github.com/apache/spark/pull/20900#discussion_r176930776


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
From a very quick look for the case "Try to be sneaky and don't use 
keywords with partial:".

Seems it's due to type mismatch. This seems working fine (in Python 3):

```
>>> spark.range(1).withColumn('ok', pandas_udf(f=partial(test_func, 2), 
returnType='bigint')('id')).show()
+---+---+
| id| ok|
+---+---+
|  0|  2|
+---+---+
```

 I think we can take this example out in the description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
@mstewart141, just to be clear, the error:

```
ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them
```

is from deprecated `getargspec` instead of `getfullargspec` that's fixed by 
you. Current error seems like this:

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/functions.py", line 2380, in 
pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 51, in _create_udf
argspec = _get_argspec(f)
  File "/.../spark/python/pyspark/util.py", line 60, in _get_argspec
argspec = inspect.getargspec(f)
  File 
"/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py",
 line 818, in getargspec
raise TypeError('{!r} is not a Python function'.format(func))
TypeError:  is not a Python 
function
```

with the reproducer below:

```python
from functools import partial
from pyspark.sql.functions import pandas_udf

def test_func(a, b):
return a + b

pandas_udf(partial(test_func, b='id'), "string")
```

I think this should work like a normal udf

```python
from functools import partial
from pyspark.sql.functions import udf

def test_func(a, b):
return a + b

normal_udf = udf(partial(test_func, b='id'), "string")
df = spark.createDataFrame([["a"]])
df.select(normal_udf("_1")).show()
```

So, I think we should add the support for callable objects / partial 
functions in Pandas UDFs. Would you be interested in filling JIRA(s) and 
proceeding? If you are busy, I am willing to do it as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88566/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20900
  
**[Test build #88566 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88566/testReport)**
 for PR 20900 at commit 
[`bc49c3c`](https://github.com/apache/spark/commit/bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20900
  
**[Test build #88566 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88566/testReport)**
 for PR 20900 at commit 
[`bc49c3c`](https://github.com/apache/spark/commit/bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20900
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
@HyukjinKwon the old pr: https://github.com/apache/spark/pull/20798

was a disaster from a git-cleanliness perspective so i've updated here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20900
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org