Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
Hi Cody,

What if the Offsets that are tracked are not present in Kafka. How do I
skip those offsets and go to the next Offset? Also would specifying
rebalance.backoff.ms be of any help?

Thanks,
Swteha

On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org> wrote:

> To be blunt, if you care about being able to recover from weird
> situations, you should be tracking offsets yourself and specifying offsets
> on job start, not relying on checkpoints.
>
> On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> I’ve seen this before during an extreme outage on the cluster, where the
>> kafka offsets checkpointed by the directstreamRdd were bigger than what
>> kafka reported. The checkpoint was therefore corrupted.
>> I don’t know the root cause but since I was stressing the cluster during
>> a reliability test I can only assume that one of the Kafka partitions was
>> restored from an out-of-sync replica and did not contain all the data.
>> Seems extreme but I don’t have another idea.
>>
>> @Cody – do you know of a way to recover from a situation like this? Can
>> someone manually delete folders from the checkpoint folder to help the job
>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>
>> -adrian
>>
>> From: swetha kasireddy
>> Date: Monday, November 9, 2015 at 10:40 PM
>> To: Cody Koeninger
>> Cc: "user@spark.apache.org"
>> Subject: Re: Kafka Direct does not recover automatically when the Kafka
>> Stream gets messed up?
>>
>> OK. But, one thing that I observed is that when there is a problem with
>> Kafka Stream, unless I delete the checkpoint directory the Streaming job
>> does not restart. I guess it tries to retry the failed tasks and if it's
>> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
>> Error.
>>
>> Why does the Streaming job not restart from checkpoint directory when the
>> job failed earlier with Kafka Brokers getting messed up? We have the
>> checkpoint directory in our hdfs.
>>
>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think deleting the checkpoint directory is a good way to restart
>>> the streaming job, you should stop the spark context or at the very least
>>> kill the driver process, then restart.
>>>
>>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Our job is our failsafe as we don't have Control over Kafka Stream as
>>>> of now. Can setting rebalance max retries help? We do not have any monitors
>>>> setup as of now. We need to setup the monitors.
>>>>
>>>> My idea is to to have some kind of Cron job that queries the Streaming
>>>> API for monitoring like every 5 minutes and then send an email alert and
>>>> automatically restart the Streaming job by deleting the Checkpoint
>>>> directory. Would that help?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> The direct stream will fail the task if there is a problem with the
>>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>>> handle broker rebalances that happen in a timely fashion.
>>>>> spark.tax.maxFailures controls the maximum number of retries before 
>>>>> failing
>>>>> the job.  Direct stream isn't any different from any other spark task in
>>>>> that regard.
>>>>>
>>>>> The question of what kind of monitoring you need is more a question
>>>>> for your particular infrastructure and what you're already using for
>>>>> monitoring.  We put all metrics (application level or system level) into
>>>>> graphite and alert from there.
>>>>>
>>>>> I will say that if you've regularly got problems with kafka falling
>>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>>> monitoring...
>>>>>
>>>>>
>>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread Cody Koeninger
You'd need to get the earliest or latest available offsets from kafka,
whichever is most appropriate for your situation.

The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
try adjusting that to get a longer sleep before retrying the task.

On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:

> Hi Cody,
>
> What if the Offsets that are tracked are not present in Kafka. How do I
> skip those offsets and go to the next Offset? Also would specifying
> rebalance.backoff.ms be of any help?
>
> Thanks,
> Swteha
>
> On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> To be blunt, if you care about being able to recover from weird
>> situations, you should be tracking offsets yourself and specifying offsets
>> on job start, not relying on checkpoints.
>>
>> On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com> wrote:
>>
>>> I’ve seen this before during an extreme outage on the cluster, where the
>>> kafka offsets checkpointed by the directstreamRdd were bigger than what
>>> kafka reported. The checkpoint was therefore corrupted.
>>> I don’t know the root cause but since I was stressing the cluster during
>>> a reliability test I can only assume that one of the Kafka partitions was
>>> restored from an out-of-sync replica and did not contain all the data.
>>> Seems extreme but I don’t have another idea.
>>>
>>> @Cody – do you know of a way to recover from a situation like this? Can
>>> someone manually delete folders from the checkpoint folder to help the job
>>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>>
>>> -adrian
>>>
>>> From: swetha kasireddy
>>> Date: Monday, November 9, 2015 at 10:40 PM
>>> To: Cody Koeninger
>>> Cc: "user@spark.apache.org"
>>> Subject: Re: Kafka Direct does not recover automatically when the Kafka
>>> Stream gets messed up?
>>>
>>> OK. But, one thing that I observed is that when there is a problem with
>>> Kafka Stream, unless I delete the checkpoint directory the Streaming job
>>> does not restart. I guess it tries to retry the failed tasks and if it's
>>> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
>>> Error.
>>>
>>> Why does the Streaming job not restart from checkpoint directory when
>>> the job failed earlier with Kafka Brokers getting messed up? We have the
>>> checkpoint directory in our hdfs.
>>>
>>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> I don't think deleting the checkpoint directory is a good way to
>>>> restart the streaming job, you should stop the spark context or at the very
>>>> least kill the driver process, then restart.
>>>>
>>>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>>>> swethakasire...@gmail.com> wrote:
>>>>
>>>>> Hi Cody,
>>>>>
>>>>> Our job is our failsafe as we don't have Control over Kafka Stream as
>>>>> of now. Can setting rebalance max retries help? We do not have any 
>>>>> monitors
>>>>> setup as of now. We need to setup the monitors.
>>>>>
>>>>> My idea is to to have some kind of Cron job that queries the Streaming
>>>>> API for monitoring like every 5 minutes and then send an email alert and
>>>>> automatically restart the Streaming job by deleting the Checkpoint
>>>>> directory. Would that help?
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> The direct stream will fail the task if there is a problem with the
>>>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>>>> handle broker rebalances that happen in a timely fashion.
>>>>>> spark.tax.maxFailures controls the maximum number of retries before 
>>>>>> failing
>>>>>> the job.  Direct stream isn't any different from any other spark task in
>>>>>> that regard.
>>>>>>
>>>>>> The question of what kind of monitoring you need is more a question
>>>>>> for your particular infrastructure and what you're already 

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
So, our Streaming Job fails with the following errors. If you see the
errors(highlighted in blue below), they are all related to Kafka losing
offsets and OffsetOutOfRangeException.

What are the options we have other than fixing Kafka? We would like to do
something like the following. How can we achieve 1 and 2 with Spark Kafka
Direct?

1.Need to see a way to skip some offsets if they are not available after
the max retries are reached..in that case there might be data loss.

2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?

3.Track the offsets separately, restart the job by providing the offsets.

4.Or a straightforward approach would be to monitor the log for this error,
and if it occurs more than X times, kill the job, remove the checkpoint
directory, and restart.

ERROR DirectKafkaInputDStream: ArrayBuffer(kafka.common.UnknownException,
org.apache.spark.SparkException: Couldn't find leader offsets for
Set([test_stream,5]))


java.lang.ClassNotFoundException:
kafka.common.NotLeaderForPartitionException

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)


java.util.concurrent.RejectedExecutionException: Task
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
rejected from java.util.concurrent.ThreadPoolExecutor@543258e0[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
12112]


org.apache.spark.SparkException: Job aborted due to stage failure: Task 10
in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in stage
52.0 (TID 255, 172.16.97.97): UnknownReason

Exception in thread "streaming-job-executor-0" java.lang.Error:
java.lang.InterruptedException

Caused by: java.lang.InterruptedException

java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)


org.apache.spark.SparkException: Job aborted due to stage failure: Task 7
in stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
33.0 (TID 283, 172.16.97.103): UnknownReason

java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)

java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)








On Mon, Nov 30, 2015 at 12:23 PM, Cody Koeninger <c...@koeninger.org> wrote:

> You'd need to get the earliest or latest available offsets from kafka,
> whichever is most appropriate for your situation.
>
> The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
> try adjusting that to get a longer sleep before retrying the task.
>
> On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> What if the Offsets that are tracked are not present in Kafka. How do I
>> skip those offsets and go to the next Offset? Also would specifying
>> rebalance.backoff.ms be of any help?
>>
>> Thanks,
>> Swteha
>>
>> On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> To be blunt, if you care about being able to recover from weird
>>> situations, you should be tracking offsets yourself and specifying offsets
>>> on job start, not relying on checkpoints.
>>>
>>> On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com>
>>> wrote:
>>>
>>>> I’ve seen this before during an extreme outage on the cluster, where
>>>> the kafka offsets checkpointed by the directstreamRdd were bigger than what
>>>> kafka reported. The checkpoint was therefore corrupted.
>>>> I don’t know the root cause but since I was stressing the cluster
>>>> during a reliability test I can only assume that one of the Kafka
>>>> partitions was restored from an out-of-sync replica and did not contain all
>>>> the data. Seems extreme but I don’t have another idea.
>>>>
>>>> @Cody – do you know of a way to recover from a situation like this? Can
>>>> someone manually delete folders from the checkpoint folder to help the job
>>>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>>>
>>>> -adrian
>>>>
>>>> From: swetha kasireddy
>>>> Date: Monday, November 9, 2015 at 10:40 PM
>>>> To: Cody Koeninger
>>>> Cc: "user@spark.apache.org"
>>>> Subject: Re: Kafka Direct does not recover automatically when the
>>>> Kafka Stream gets messed up?
>>>>
>>>

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-12 Thread Cody Koeninger
To be blunt, if you care about being able to recover from weird situations,
you should be tracking offsets yourself and specifying offsets on job
start, not relying on checkpoints.

On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase <atan...@adobe.com> wrote:

> I’ve seen this before during an extreme outage on the cluster, where the
> kafka offsets checkpointed by the directstreamRdd were bigger than what
> kafka reported. The checkpoint was therefore corrupted.
> I don’t know the root cause but since I was stressing the cluster during a
> reliability test I can only assume that one of the Kafka partitions was
> restored from an out-of-sync replica and did not contain all the data.
> Seems extreme but I don’t have another idea.
>
> @Cody – do you know of a way to recover from a situation like this? Can
> someone manually delete folders from the checkpoint folder to help the job
> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>
> -adrian
>
> From: swetha kasireddy
> Date: Monday, November 9, 2015 at 10:40 PM
> To: Cody Koeninger
> Cc: "user@spark.apache.org"
> Subject: Re: Kafka Direct does not recover automatically when the Kafka
> Stream gets messed up?
>
> OK. But, one thing that I observed is that when there is a problem with
> Kafka Stream, unless I delete the checkpoint directory the Streaming job
> does not restart. I guess it tries to retry the failed tasks and if it's
> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
> Error.
>
> Why does the Streaming job not restart from checkpoint directory when the
> job failed earlier with Kafka Brokers getting messed up? We have the
> checkpoint directory in our hdfs.
>
> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I don't think deleting the checkpoint directory is a good way to restart
>> the streaming job, you should stop the spark context or at the very least
>> kill the driver process, then restart.
>>
>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi Cody,
>>>
>>> Our job is our failsafe as we don't have Control over Kafka Stream as of
>>> now. Can setting rebalance max retries help? We do not have any monitors
>>> setup as of now. We need to setup the monitors.
>>>
>>> My idea is to to have some kind of Cron job that queries the Streaming
>>> API for monitoring like every 5 minutes and then send an email alert and
>>> automatically restart the Streaming job by deleting the Checkpoint
>>> directory. Would that help?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> The direct stream will fail the task if there is a problem with the
>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>> handle broker rebalances that happen in a timely fashion.
>>>> spark.tax.maxFailures controls the maximum number of retries before failing
>>>> the job.  Direct stream isn't any different from any other spark task in
>>>> that regard.
>>>>
>>>> The question of what kind of monitoring you need is more a question for
>>>> your particular infrastructure and what you're already using for
>>>> monitoring.  We put all metrics (application level or system level) into
>>>> graphite and alert from there.
>>>>
>>>> I will say that if you've regularly got problems with kafka falling
>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>> monitoring...
>>>>
>>>>
>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>> with
>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>> entire
>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>> automatically
>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>> monitors
>>>>> should be in place to recover the job?
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
Can you be a bit more specific about what “blow up” means? Also what do you 
mean by “messed up” brokers? Inbalance? Broker(s) dead?

We’re also using the direct consumer and so far nothing dramatic happened:
- on READ it automatically reads from backups if leader is dead (machine gone)
- or READ if there is a huge imbalance (partitions/leaders) the job might slow 
down if you don’t have enough cores on the machine with many partitions
- on WRITE - we’ve seen a weird delay of ~7 seconds that I don’t know how to 
re-configure, there’s a timeout that delays the job but it eventually writes 
data to a replica
- it only died when there are no more brokers left and there are partitions 
without a leader. This happened when almost half the cluster was dead during a 
reliability test

Regardless, I would look at the source and try to monitor the kafka cluster for 
things like partitions without leaders or big inbalances.

Hope this helps,
-adrian





On 11/9/15, 8:26 PM, "swetha" <swethakasire...@gmail.com> wrote:

>Hi,
>
>How to recover Kafka Direct automatically when the there is a problem with
>Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>Streaming job blows up unlike some other consumers which do recover
>automatically. How can I make sure that Kafka Direct recovers automatically
>when the broker fails for sometime say 30 minutes? What kind of monitors
>should be in place to recover the job?
>
>Thanks,
>Swetha 
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
I’ve seen this before during an extreme outage on the cluster, where the kafka 
offsets checkpointed by the directstreamRdd were bigger than what kafka 
reported. The checkpoint was therefore corrupted.
I don’t know the root cause but since I was stressing the cluster during a 
reliability test I can only assume that one of the Kafka partitions was 
restored from an out-of-sync replica and did not contain all the data. Seems 
extreme but I don’t have another idea.

@Cody – do you know of a way to recover from a situation like this? Can someone 
manually delete folders from the checkpoint folder to help the job recover? 
E.g. Go 2 steps back, hoping that kafka has those offsets.

-adrian

From: swetha kasireddy
Date: Monday, November 9, 2015 at 10:40 PM
To: Cody Koeninger
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Kafka Direct does not recover automatically when the Kafka Stream 
gets messed up?

OK. But, one thing that I observed is that when there is a problem with Kafka 
Stream, unless I delete the checkpoint directory the Streaming job does not 
restart. I guess it tries to retry the failed tasks and if it's not able to 
recover, it fails again. Sometimes, it fails with StackOverFlow Error.

Why does the Streaming job not restart from checkpoint directory when the job 
failed earlier with Kafka Brokers getting messed up? We have the checkpoint 
directory in our hdfs.

On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>> wrote:
I don't think deleting the checkpoint directory is a good way to restart the 
streaming job, you should stop the spark context or at the very least kill the 
driver process, then restart.

On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:
Hi Cody,

Our job is our failsafe as we don't have Control over Kafka Stream as of now. 
Can setting rebalance max retries help? We do not have any monitors setup as of 
now. We need to setup the monitors.

My idea is to to have some kind of Cron job that queries the Streaming API for 
monitoring like every 5 minutes and then send an email alert and automatically 
restart the Streaming job by deleting the Checkpoint directory. Would that help?



Thanks!

On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>> wrote:
The direct stream will fail the task if there is a problem with the kafka 
broker.  Spark will retry failed tasks automatically, which should handle 
broker rebalances that happen in a timely fashion. spark.tax.maxFailures 
controls the maximum number of retries before failing the job.  Direct stream 
isn't any different from any other spark task in that regard.

The question of what kind of monitoring you need is more a question for your 
particular infrastructure and what you're already using for monitoring.  We put 
all metrics (application level or system level) into graphite and alert from 
there.

I will say that if you've regularly got problems with kafka falling over for 
half an hour, I'd look at fixing that before worrying about spark monitoring...


On Mon, Nov 9, 2015 at 12:26 PM, swetha 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:
Hi,

How to recover Kafka Direct automatically when the there is a problem with
Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
Streaming job blows up unlike some other consumers which do recover
automatically. How can I make sure that Kafka Direct recovers automatically
when the broker fails for sometime say 30 minutes? What kind of monitors
should be in place to recover the job?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>







Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
The direct stream will fail the task if there is a problem with the kafka
broker.  Spark will retry failed tasks automatically, which should handle
broker rebalances that happen in a timely fashion. spark.tax.maxFailures
controls the maximum number of retries before failing the job.  Direct
stream isn't any different from any other spark task in that regard.

The question of what kind of monitoring you need is more a question for
your particular infrastructure and what you're already using for
monitoring.  We put all metrics (application level or system level) into
graphite and alert from there.

I will say that if you've regularly got problems with kafka falling over
for half an hour, I'd look at fixing that before worrying about spark
monitoring...


On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com> wrote:

> Hi,
>
> How to recover Kafka Direct automatically when the there is a problem with
> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
> Streaming job blows up unlike some other consumers which do recover
> automatically. How can I make sure that Kafka Direct recovers automatically
> when the broker fails for sometime say 30 minutes? What kind of monitors
> should be in place to recover the job?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
I don't think deleting the checkpoint directory is a good way to restart
the streaming job, you should stop the spark context or at the very least
kill the driver process, then restart.

On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> Hi Cody,
>
> Our job is our failsafe as we don't have Control over Kafka Stream as of
> now. Can setting rebalance max retries help? We do not have any monitors
> setup as of now. We need to setup the monitors.
>
> My idea is to to have some kind of Cron job that queries the Streaming API
> for monitoring like every 5 minutes and then send an email alert and
> automatically restart the Streaming job by deleting the Checkpoint
> directory. Would that help?
>
>
>
> Thanks!
>
> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> The direct stream will fail the task if there is a problem with the kafka
>> broker.  Spark will retry failed tasks automatically, which should handle
>> broker rebalances that happen in a timely fashion. spark.tax.maxFailures
>> controls the maximum number of retries before failing the job.  Direct
>> stream isn't any different from any other spark task in that regard.
>>
>> The question of what kind of monitoring you need is more a question for
>> your particular infrastructure and what you're already using for
>> monitoring.  We put all metrics (application level or system level) into
>> graphite and alert from there.
>>
>> I will say that if you've regularly got problems with kafka falling over
>> for half an hour, I'd look at fixing that before worrying about spark
>> monitoring...
>>
>>
>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> How to recover Kafka Direct automatically when the there is a problem
>>> with
>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>>> Streaming job blows up unlike some other consumers which do recover
>>> automatically. How can I make sure that Kafka Direct recovers
>>> automatically
>>> when the broker fails for sometime say 30 minutes? What kind of monitors
>>> should be in place to recover the job?
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
OK. But, one thing that I observed is that when there is a problem with
Kafka Stream, unless I delete the checkpoint directory the Streaming job
does not restart. I guess it tries to retry the failed tasks and if it's
not able to recover, it fails again. Sometimes, it fails with StackOverFlow
Error.

Why does the Streaming job not restart from checkpoint directory when the
job failed earlier with Kafka Brokers getting messed up? We have the
checkpoint directory in our hdfs.

On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org> wrote:

> I don't think deleting the checkpoint directory is a good way to restart
> the streaming job, you should stop the spark context or at the very least
> kill the driver process, then restart.
>
> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> Our job is our failsafe as we don't have Control over Kafka Stream as of
>> now. Can setting rebalance max retries help? We do not have any monitors
>> setup as of now. We need to setup the monitors.
>>
>> My idea is to to have some kind of Cron job that queries the Streaming
>> API for monitoring like every 5 minutes and then send an email alert and
>> automatically restart the Streaming job by deleting the Checkpoint
>> directory. Would that help?
>>
>>
>>
>> Thanks!
>>
>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> The direct stream will fail the task if there is a problem with the
>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>> handle broker rebalances that happen in a timely fashion.
>>> spark.tax.maxFailures controls the maximum number of retries before failing
>>> the job.  Direct stream isn't any different from any other spark task in
>>> that regard.
>>>
>>> The question of what kind of monitoring you need is more a question for
>>> your particular infrastructure and what you're already using for
>>> monitoring.  We put all metrics (application level or system level) into
>>> graphite and alert from there.
>>>
>>> I will say that if you've regularly got problems with kafka falling over
>>> for half an hour, I'd look at fixing that before worrying about spark
>>> monitoring...
>>>
>>>
>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> How to recover Kafka Direct automatically when the there is a problem
>>>> with
>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>>>> Streaming job blows up unlike some other consumers which do recover
>>>> automatically. How can I make sure that Kafka Direct recovers
>>>> automatically
>>>> when the broker fails for sometime say 30 minutes? What kind of monitors
>>>> should be in place to recover the job?
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
Without knowing more about what's being stored in your checkpoint directory
/ what the log output is, it's hard to say.  But either way, just deleting
the checkpoint directory probably isn't sufficient to restart the job...

On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> OK. But, one thing that I observed is that when there is a problem with
> Kafka Stream, unless I delete the checkpoint directory the Streaming job
> does not restart. I guess it tries to retry the failed tasks and if it's
> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
> Error.
>
> Why does the Streaming job not restart from checkpoint directory when the
> job failed earlier with Kafka Brokers getting messed up? We have the
> checkpoint directory in our hdfs.
>
> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I don't think deleting the checkpoint directory is a good way to restart
>> the streaming job, you should stop the spark context or at the very least
>> kill the driver process, then restart.
>>
>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi Cody,
>>>
>>> Our job is our failsafe as we don't have Control over Kafka Stream as of
>>> now. Can setting rebalance max retries help? We do not have any monitors
>>> setup as of now. We need to setup the monitors.
>>>
>>> My idea is to to have some kind of Cron job that queries the Streaming
>>> API for monitoring like every 5 minutes and then send an email alert and
>>> automatically restart the Streaming job by deleting the Checkpoint
>>> directory. Would that help?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> The direct stream will fail the task if there is a problem with the
>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>> handle broker rebalances that happen in a timely fashion.
>>>> spark.tax.maxFailures controls the maximum number of retries before failing
>>>> the job.  Direct stream isn't any different from any other spark task in
>>>> that regard.
>>>>
>>>> The question of what kind of monitoring you need is more a question for
>>>> your particular infrastructure and what you're already using for
>>>> monitoring.  We put all metrics (application level or system level) into
>>>> graphite and alert from there.
>>>>
>>>> I will say that if you've regularly got problems with kafka falling
>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>> monitoring...
>>>>
>>>>
>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>> with
>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>> entire
>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>> automatically
>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>> monitors
>>>>> should be in place to recover the job?
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
I store some metrics and the RDD which is the output of updateStateByKey in
my checkpoint directory. Will retest and check for the error that I get.
But,  it's mostly the StackOverFlowError that I get. So, increasing the
Stack size might help?

On Mon, Nov 9, 2015 at 12:45 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Without knowing more about what's being stored in your checkpoint
> directory / what the log output is, it's hard to say.  But either way, just
> deleting the checkpoint directory probably isn't sufficient to restart the
> job...
>
> On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> OK. But, one thing that I observed is that when there is a problem with
>> Kafka Stream, unless I delete the checkpoint directory the Streaming job
>> does not restart. I guess it tries to retry the failed tasks and if it's
>> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
>> Error.
>>
>> Why does the Streaming job not restart from checkpoint directory when the
>> job failed earlier with Kafka Brokers getting messed up? We have the
>> checkpoint directory in our hdfs.
>>
>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think deleting the checkpoint directory is a good way to restart
>>> the streaming job, you should stop the spark context or at the very least
>>> kill the driver process, then restart.
>>>
>>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Our job is our failsafe as we don't have Control over Kafka Stream as
>>>> of now. Can setting rebalance max retries help? We do not have any monitors
>>>> setup as of now. We need to setup the monitors.
>>>>
>>>> My idea is to to have some kind of Cron job that queries the Streaming
>>>> API for monitoring like every 5 minutes and then send an email alert and
>>>> automatically restart the Streaming job by deleting the Checkpoint
>>>> directory. Would that help?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> The direct stream will fail the task if there is a problem with the
>>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>>> handle broker rebalances that happen in a timely fashion.
>>>>> spark.tax.maxFailures controls the maximum number of retries before 
>>>>> failing
>>>>> the job.  Direct stream isn't any different from any other spark task in
>>>>> that regard.
>>>>>
>>>>> The question of what kind of monitoring you need is more a question
>>>>> for your particular infrastructure and what you're already using for
>>>>> monitoring.  We put all metrics (application level or system level) into
>>>>> graphite and alert from there.
>>>>>
>>>>> I will say that if you've regularly got problems with kafka falling
>>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>>> monitoring...
>>>>>
>>>>>
>>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>>> with
>>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>>> entire
>>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>>> automatically
>>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>>> monitors
>>>>>> should be in place to recover the job?
>>>>>>
>>>>>> Thanks,
>>>>>> Swetha
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>