Re: Understanding Restart Strategy

Aljoscha Krettek Fri, 26 Jan 2018 02:31:13 -0800
Thanks for the update!

> On 25. Jan 2018, at 04:12, Ashish Pokharel <ashish...@yahoo.com> wrote:
> 
> FYI,
> 
> I think I have gotten to the bottom this situation. For anyone who might be 
> in situation hopefully my observations will help.
> 
> In my case, it had nothing to do with Flink Restart Strategy, it was doing 
> it’s thing as expected. Issue really was, Kafka Producer timeout counters. As 
> I mentioned in other thread, we have a capacity issue with our Kafka cluster 
> that ends up causing some timeout in our Flink Applications (we do have 
> throttle in place in Kafka to manage it better but still we run into timeout 
> pretty often right unfortunately). 
> 
> We had set our Kafka Producer retries to 10. It seems like that retry counter 
> never gets reset. So over life of an App if it hits 10 timeouts, it basically 
> couldn’t start and went to a Failed state. I am yet to dig into whether this 
> can be solved from Flink Kafka wrapper or not. But, for now we have set the 
> retries to 0 and hopefully this situation will not happen.
> 
> If anyone has any similar observations pl feel free to share.
> 
> Thanks, Ashish
> 
>> On Jan 19, 2018, at 2:43 PM, ashish pok <ashish...@yahoo.com 
>> <mailto:ashish...@yahoo.com>> wrote:
>> 
>> Team,
>> 
>> Hopefully, this is a quick one. 
>> 
>> We have setup restart strategy as follows in pretty much all of our apps:
>> 
>>     env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 
>> Time.of(30, TimeUnit.SECONDS)));
>> 
>> This seems pretty straight-forward. App should retry starting 10 times every 
>> 30 seconds - so about 5 minutes. Either we are not understanding this or it 
>> seems inconsistent. Some of the applications restart and come back fine on 
>> issues like Kafka timeout (which I will come back to later) but in some 
>> cases same issues pretty much shuts the app down. 
>> 
>> My first guess here was that total count of 10 is not reset after App 
>> recovered normally. Is there a need to manually reset the counter in an App? 
>> I doubt Flink would be treating it like a counter that spans the life of an 
>> App instead of resetting on successful start-up - but not sure how else to 
>> explain the behavior.
>> 
>> Along the same line, what actually constitutes as a "restart"? Our Kafka 
>> cluster has known performance bottlenecks during certain times of day that 
>> we are working to resolve. I do notice Kafka producer timeouts quite a few 
>> times during these times. When App hits these timeouts, it does recover fine 
>> but I dont necessary see entire application restarting as I dont see 
>> bootstrap logs of my App. Does something like this count as a restart of App 
>> from Restart Strategy perspective as well vs things like apps crashes/Yarn 
>> killing application etc. where App is actually restarted from scratch?
>> 
>> We are really liking Flink, just need to hash out these operational issues 
>> to make it prime time for all streaming apps we have in our cluster.
>> 
>> Thanks,
>> 
>> Ashish
>
Re: Understanding Restart Strategy

Reply via email to