Hi all,

I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
whenever I'm running a heavy computation job in parallel to other jobs
running, I'm getting these kind of exceptions:

* [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
Lost task 820.0 in stage 175.0 (TID 11327) on executor xxxxxxx:
java.io.IOException (Failed to connect to xxxxxxxxxx:35194) [duplicate 12]

* org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 12

* org.apache.spark.shuffle.FetchFailedException: Failed to connect to
xxxxxxxxxxxxxxxxx:35194
Caused by: java.io.IOException: Failed to connect to xxxxxxxxxxxxxxxxx:35194

when running the heavy job alone on the cluster, I'm not getting any
errors. My guess is that spark contexts from different apps do not share
information about taken ports, and therefore collide on specific ports,
causing the job/stage to fail. Is there a way to assign a specific set of
executors to a specific spark job via "spark-submit", or is there a way to
define a range of ports to be used by the application?

Thanks!
Tomer

Reply via email to