I don't think deleting the checkpoint directory is a good way to restart
the streaming job, you should stop the spark context or at the very least
kill the driver process, then restart.

On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> Hi Cody,
>
> Our job is our failsafe as we don't have Control over Kafka Stream as of
> now. Can setting rebalance max retries help? We do not have any monitors
> setup as of now. We need to setup the monitors.
>
> My idea is to to have some kind of Cron job that queries the Streaming API
> for monitoring like every 5 minutes and then send an email alert and
> automatically restart the Streaming job by deleting the Checkpoint
> directory. Would that help?
>
>
>
> Thanks!
>
> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> The direct stream will fail the task if there is a problem with the kafka
>> broker.  Spark will retry failed tasks automatically, which should handle
>> broker rebalances that happen in a timely fashion. spark.tax.maxFailures
>> controls the maximum number of retries before failing the job.  Direct
>> stream isn't any different from any other spark task in that regard.
>>
>> The question of what kind of monitoring you need is more a question for
>> your particular infrastructure and what you're already using for
>> monitoring.  We put all metrics (application level or system level) into
>> graphite and alert from there.
>>
>> I will say that if you've regularly got problems with kafka falling over
>> for half an hour, I'd look at fixing that before worrying about spark
>> monitoring...
>>
>>
>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> How to recover Kafka Direct automatically when the there is a problem
>>> with
>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>>> Streaming job blows up unlike some other consumers which do recover
>>> automatically. How can I make sure that Kafka Direct recovers
>>> automatically
>>> when the broker fails for sometime say 30 minutes? What kind of monitors
>>> should be in place to recover the job?
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to