[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...

2018-06-26 Thread vamaral1
Github user vamaral1 commented on the issue:

https://github.com/apache/spark/pull/21397
  
Thanks for the quick responses. I did try to build everything from scratch 
and am still getting the error on large datasets. If I run on a few tens of GB, 
there's no problem but once it gets to a couple hundred GB, that's when I start 
seeing the issue. I will try to create a reproducible example and post it here 
shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...

2018-06-26 Thread vamaral1
Github user vamaral1 commented on the issue:

https://github.com/apache/spark/pull/21397
  
Thanks for the fix. I was having the memory leak issue described in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-24334) when working with 
pandas udf's but was able to fix it after upgrading my Spark version to get the 
patch. However, now I'm getting an issue related with the serializer and I'm 
having trouble debugging and understanding the stack trace. Any ideas?

```
INFO TaskSetManager: Lost task [...] 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File "/home/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 
230, in main
process()
  File "/home/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 
225, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", 
line 260, in dump_stream
for series in iterator:
  File "/home/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", 
line 279, in load_stream
for batch in reader:
  File "ipc.pxi", line 268, in __iter__
  File "ipc.pxi", line 284, in 
pyarrow.lib._RecordBatchReader.read_next_batch
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org