I don't think deleting the checkpoint directory is a good way to restart the streaming job, you should stop the spark context or at the very least kill the driver process, then restart.
On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > Hi Cody, > > Our job is our failsafe as we don't have Control over Kafka Stream as of > now. Can setting rebalance max retries help? We do not have any monitors > setup as of now. We need to setup the monitors. > > My idea is to to have some kind of Cron job that queries the Streaming API > for monitoring like every 5 minutes and then send an email alert and > automatically restart the Streaming job by deleting the Checkpoint > directory. Would that help? > > > > Thanks! > > On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> The direct stream will fail the task if there is a problem with the kafka >> broker. Spark will retry failed tasks automatically, which should handle >> broker rebalances that happen in a timely fashion. spark.tax.maxFailures >> controls the maximum number of retries before failing the job. Direct >> stream isn't any different from any other spark task in that regard. >> >> The question of what kind of monitoring you need is more a question for >> your particular infrastructure and what you're already using for >> monitoring. We put all metrics (application level or system level) into >> graphite and alert from there. >> >> I will say that if you've regularly got problems with kafka falling over >> for half an hour, I'd look at fixing that before worrying about spark >> monitoring... >> >> >> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com> >> wrote: >> >>> Hi, >>> >>> How to recover Kafka Direct automatically when the there is a problem >>> with >>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire >>> Streaming job blows up unlike some other consumers which do recover >>> automatically. How can I make sure that Kafka Direct recovers >>> automatically >>> when the broker fails for sometime say 30 minutes? What kind of monitors >>> should be in place to recover the job? >>> >>> Thanks, >>> Swetha >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >