Yes. If the job fails repeatedly (4 times in this case), Spark assumes that there is a problem in the Job and notifies the user. In exchange for this, the engine can go on to serve other jobs with its available resources.
I would try the following until things improve: 1. Figure out what's wrong with the Job that's failing 2. Make exception handling more functional: http://oneeyedmen.com/failure-is-not-an-option-part-4.html (kotlin, but ideas still work) Why do #2? Because it will force you to decide which exceptions: your code should handle, spark should handle, should cause job failure (current behavior). By analogy to the halting problem <https://en.wikipedia.org/wiki/Chaitin%27s_constant#Relationship_to_the_halting_problem>, I believe that expecting a program to handle all possible exceptional states is unreasonable. Jm2c Jason On Tue, May 21, 2019 at 9:30 AM bsikander <behro...@gmail.com> wrote: > umm, i am not sure if I got this fully. > > It is a design decision to not have context.stop() right after > awaitTermination throws exception? > > So, the ideology is that if after n tries (default 4) a task fails, the > spark should fail fast and let user know? Is this correct? > > > As you mentioned there are many error classes and as the chances of getting > an exception are quite high. If the above ideology is correct then it makes > it really hard to keep the job up and running all the time especially > streaming cases. > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Jason