Hello,

I am running spark 1.5.1 on EMR using Python 3.
I have a pyspark job which is doing some simple joins and reduceByKey
operations. It works fine most of the time, but sometimes I get the
following error:

15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID
69, ip-172-31-8-142.ap-southeast-1.compute.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py",
line 111, in main
    process()
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py",
line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/serializers.py",
line 133, in dump_stream
    for obj in iterator:
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1723, in add_shuffle_key
OverflowError: cannot convert float infinity to integer

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
        at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914)
        at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
        at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
        at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
        at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


The line in question is:
https://github.com/apache/spark/blob/4f894dd6906311cb57add6757690069a18078783/python/pyspark/rdd.py#L1723

I'm having a hard time seeing how `batch` could ever be set to infinity.
The error it is also inconsistent to reproduce.

Help?

Reply via email to