Hi, I don't know if it is a bug or a feature, but one thing in streaming error handling seems confusing to me - I create streaming context, start and call #awaitTermination like this:
try { ssc.awaitTermination(); } catch (Exception e) { LoggerFactory.getLogger(getClass()).error("Job failed. Stopping JVM", e); System.exit(-1); } I expect that jvm will be terminated as soon as any job fails and no more jobs are started. But actually this is not true - before exception is caught another job starts. This is caused by the design of JobScheduler event loop: private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } } If error happens it calls handleError that wakes up a lock in ContextWaiter and notifies my main thread. But meanwhile it starts next job, and sometimes it is enough to complete it! I have several jobs in each batch and want each of them run only and only if previous completed successfully. For API user point of view this behavior is confusing and you cannot guess how it works until looking into the source code. What do you think about adding another spark configuration parameter 'stopOnError' that stops the streaming context if error happens and does not allow to run next job?