Nevermind - I don't know what I was thinking with the below. It's just 
maxTaskFailures causing the job to failure.

From: Griffiths, Michael (NYC-RPM) [mailto:michael.griffi...@reprisemedia.com]
Sent: Monday, November 10, 2014 4:48 PM
To: user@spark.apache.org
Subject: Spark Master crashes job on task failure

Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node 
with the ec2 script, and I'm currently breaking the job into many small parts 
(~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The 
job consists of loading a file from S3, performing minor parsing, storing the 
results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe 
timeout errors - and for over half of the jobs that fail, when they are re-run 
they succeed. Still, a task failing shouldn't crash the entire job: it should 
just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that 
when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a 
SparkException and brings down Spark Master. If it was a slave, it would be OK 
- it could either re-register and continue, or not, but the entire job would 
continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on 
when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't 
see that option. Does that exist? Can I exclude Master from accepting tasks in 
Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what 
file in S3) is causing the error and fixing it. But right now I'd just like to 
get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael

Reply via email to