streaming JobScheduler and error handling confusing behavior

Krot Viacheslav Tue, 07 Jun 2016 07:12:17 -0700

Hi,
I don't know if it is a bug or a feature, but one thing in streaming error
handling seems confusing to me - I create streaming context, start and call
#awaitTermination like this:


try {
  ssc.awaitTermination();
} catch (Exception e) {
  LoggerFactory.getLogger(getClass()).error("Job failed. Stopping JVM", e);
  System.exit(-1);
}

I expect that jvm will be terminated as soon as any job fails and no more
jobs are started. But actually this is not true - before exception is
caught another job starts.
This is caused by the design of JobScheduler event loop:

private def processEvent(event: JobSchedulerEvent) {
  try {
    event match {
      case JobStarted(job, startTime) => handleJobStart(job, startTime)
      case JobCompleted(job, completedTime) => handleJobCompletion(job,
completedTime)
      case ErrorReported(m, e) => handleError(m, e)
    }
  } catch {
    case e: Throwable =>
      reportError("Error in job scheduler", e)
  }
}

If error happens it calls handleError that wakes up a lock in ContextWaiter
and notifies my main thread. But meanwhile it starts next job, and
sometimes it is enough to complete it! I have several jobs in each batch
and want each of them run only and only if previous completed successfully.

For API user point of view this behavior is confusing and you cannot guess
how it works until looking into the source code.

What do you think about adding another spark configuration parameter
'stopOnError' that stops the streaming context if error happens and does
not allow to run next job?

streaming JobScheduler and error handling confusing behavior

Reply via email to