Hello, I am running spark 1.5.1 on EMR using Python 3. I have a pyspark job which is doing some simple joins and reduceByKey operations. It works fine most of the time, but sometimes I get the following error:
15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID 69, ip-172-31-8-142.ap-southeast-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream for obj in iterator: File "/usr/lib/spark/python/pyspark/rdd.py", line 1723, in add_shuffle_key OverflowError: cannot convert float infinity to integer at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) The line in question is: https://github.com/apache/spark/blob/4f894dd6906311cb57add6757690069a18078783/python/pyspark/rdd.py#L1723 I'm having a hard time seeing how `batch` could ever be set to infinity. The error it is also inconsistent to reproduce. Help?