You can try running the driver in the cluster manager with --supervise, but
that's basically the same as restarting it when it fails.

There is no reasonable automatic "recovery" when something is fundamentally
wrong with your kafka cluster.

On Wed, Oct 21, 2015 at 12:46 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Cody,
>
> What other options do I have other than monitoring and restarting the job?
> Can the job recover automatically?
>
> Thanks,
> Sweth
>
> On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Did you check you kafka broker logs to see what was going on during that
>> time?
>>
>> The direct stream will handle normal leader loss / rebalance by retrying
>> tasks.
>>
>> But the exception you got indicates that something with kafka was wrong,
>> such that offsets were being re-used.
>>
>> ie. your job already processed up through beginning offset 15027734702
>>
>> but when asking kafka for the highest available offsets, it returns
>> ending offset 15027725493
>>
>> which is lower, in other words kafka lost messages.  This might happen
>> because you lost a leader and recovered from a replica that wasn't in sync,
>> or someone manually screwed up a topic, or ... ?
>>
>> If you really want to just blindly "recover" from this situation (even
>> though something is probably wrong with your data), the most
>> straightforward thing to do is monitor and restart your job.
>>
>>
>>
>>
>> On Wed, Sep 30, 2015 at 4:31 PM, swetha <swethakasire...@gmail.com>
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I see this sometimes in our Kafka Direct approach in our Streaming job.
>>> How
>>> do we make sure that the job recovers from such errors and works normally
>>> thereafter?
>>>
>>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream
>>> partition
>>> 19,  sleeping for 200ms
>>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream
>>> partition
>>> 5,  sleeping for 200ms
>>>
>>> Followed by every task failing with something like this:
>>>
>>> 15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
>>> (TID 818804)
>>> kafka.common.NotLeaderForPartitionException
>>>
>>> And:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 15
>>> in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
>>> stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
>>> assertion failed: Beginning offset 15027734702 is after the ending
>>> offset
>>> 15027725493 for topic hubble_stream partition 12. You either provided an
>>> invalid fromOffset, or the Kafka topic has been damaged
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to