GitHub user e-dorigatti opened a pull request:
https://github.com/apache/spark/pull/21383
[SPARK-23754][Python] Re-raising StopIteration in client code
## What changes were proposed in this pull request?
Make sure that `StopIteration`s raised in users' code do not silently
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r189823961
--- Diff: python/pyspark/shuffle.py ---
@@ -67,6 +67,19 @@ def get_used_memory():
return 0
+def safe_iter(f
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r189863049
--- Diff: python/pyspark/rdd.py ---
@@ -173,6 +173,7 @@ def ignore_unicode_prefix(f):
return f
+
--- End diff --
I
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r189872060
--- Diff: python/pyspark/shuffle.py ---
@@ -67,6 +67,19 @@ def get_used_memory():
return 0
+def safe_iter(f
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r189890760
--- Diff: python/pyspark/shuffle.py ---
@@ -67,6 +67,19 @@ def get_used_memory():
return 0
+def safe_iter(f
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r190153031
--- Diff: python/pyspark/shuffle.py ---
@@ -67,6 +67,19 @@ def get_used_memory():
return 0
+def safe_iter(f
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21383
I am thinking to use `fail_on_stopiteration` in the worker instead of in
the `UserDefinedFunction`. I don't really like this solution since you have to
fix every other place that uses an udf
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r190603773
--- Diff: python/pyspark/tests.py ---
@@ -1246,6 +1277,25 @@ def test_pipe_unicode(self):
result = rdd.pipe('cat').collect
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r190567010
--- Diff: python/pyspark/tests.py ---
@@ -1246,6 +1277,31 @@ def test_pipe_unicode(self):
result = rdd.pipe('cat').collect
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r190656662
--- Diff: python/pyspark/tests.py ---
@@ -1246,6 +1277,25 @@ def test_pipe_unicode(self):
result = rdd.pipe('cat').collect
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21383
I must be doing something wrong because tests pass on my machine. For now I
have no other way but keep pushing here
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21383
I am really not sure how moving and renaming the function broke the `udf`
test, I'm looking into it right now
GitHub user e-dorigatti opened a pull request:
https://github.com/apache/spark/pull/21538
[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from
driver to executor
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user
function, but this required
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21538
Seems like it skipped the pandas tests, for both python2.7 and pypy
```
Will skip Pandas related features against Python executable
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21467#discussion_r194041622
--- Diff: python/pyspark/worker.py ---
@@ -140,15 +139,18 @@ def read_single_udf(pickleSer, infile, eval_type):
else
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
I updated the description, so that it will be included in the commit
message, and slightly clarified some comments
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21538
@HyukjinKwon thank you so much for your patience :)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user e-dorigatti closed the pull request at:
https://github.com/apache/spark/pull/21538
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user e-dorigatti closed the pull request at:
https://github.com/apache/spark/pull/21461
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user e-dorigatti closed the pull request at:
https://github.com/apache/spark/pull/21463
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21463
@HyukjinKwon go ahead, no problem
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
GitHub user e-dorigatti opened a pull request:
https://github.com/apache/spark/pull/21467
[SPARK-23754][Python] Fix UDF hack
## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user
function, but this required
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21463
@HyukjinKwon haha no probs :) I have to open a new PR tho as yesterday I
messed up my repos.
---
-
To unsubscribe, e-mail
GitHub user e-dorigatti opened a pull request:
https://github.com/apache/spark/pull/21461
[SPARK-23754][Python] Backport bugfix
Fix for master was already accepted in [another pull
request](https://github.com/apache/spark/pull/21383), but there were conflicts
while merging
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21383
@HyukjinKwon [here](https://github.com/apache/spark/pull/21461)
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r191418719
--- Diff: python/pyspark/sql/udf.py ---
@@ -157,7 +157,17 @@ def _create_judf(self):
spark = SparkSession.builder.getOrCreate
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r191427825
--- Diff: python/pyspark/sql/udf.py ---
@@ -157,7 +157,17 @@ def _create_judf(self):
spark = SparkSession.builder.getOrCreate
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
does jenkins show _all_ the failed tests? or just the first one?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
I'm the least qualified to answer here, but technically the bug is fixed
and this is just to clean the code a bit, so it shouldn't be a blocker (but
don't quote me
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
@viirya we only want to reverd `udf.py` and the hack in `_get_argspec`. Did
I miss anything there?
---
-
To unsubscribe, e
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21467#discussion_r192502611
--- Diff: python/pyspark/util.py ---
@@ -53,16 +53,11 @@ def _get_argspec(f):
"""
Get argspec of a function. Support
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
`PyArrow` again
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
@viirya for RDDs, we cannot check for stopiterations in the executor
because this bug is introduced in the driver by rewriting the methods. consider
rdd.map:
```
def map(self, f
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
messed up, of course :) fixing
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
@HyukjinKwon thank you, I though it was reserved to members
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21383
Yes, the problem was that the signature is lost when the function is
wrapped, and the worker needs the signature to know whether the function needs
keys together with values
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r191113977
--- Diff: python/pyspark/util.py ---
@@ -89,6 +93,33 @@ def majorMinorVersion(sparkVersion):
" version nu
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r191472884
--- Diff: python/pyspark/sql/udf.py ---
@@ -157,7 +157,17 @@ def _create_judf(self):
spark = SparkSession.builder.getOrCreate
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21383#discussion_r191473968
--- Diff: python/pyspark/sql/tests.py ---
@@ -900,6 +900,22 @@ def __call__(self, x):
self.assertEqual(f, f_.func
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21467#discussion_r192114498
--- Diff: python/pyspark/worker.py ---
@@ -69,6 +69,7 @@ def chain(f, g):
def wrap_udf(f, return_type):
+f
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21467#discussion_r192116000
--- Diff: python/pyspark/worker.py ---
@@ -69,6 +69,7 @@ def chain(f, g):
def wrap_udf(f, return_type):
+f
GitHub user e-dorigatti opened a pull request:
https://github.com/apache/spark/pull/21463
[SPARK-23754][BRANCH-2.3][PYTHON] Re-raising StopIteration in client code
## What changes are proposed
Make sure that `StopIteration`s raised in users' code do not silently
interrupt
Github user e-dorigatti commented on a diff in the pull request:
https://github.com/apache/spark/pull/21467#discussion_r192455582
--- Diff: python/pyspark/worker.py ---
@@ -140,15 +139,20 @@ def read_single_udf(pickleSer, infile, eval_type):
else
Github user e-dorigatti commented on the issue:
https://github.com/apache/spark/pull/21467
Seems like a problem in Jenkins
```
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not fo
45 matches
Mail list logo