Thanks for the update!
> On 25. Jan 2018, at 04:12, Ashish Pokharel <ashish...@yahoo.com> wrote:
>
> FYI,
>
> I think I have gotten to the bottom this situation. For anyone who might be
> in situation hopefully my observations will help.
>
> In my case, it had nothing to do with Flink Restart Strategy, it was doing
> it’s thing as expected. Issue really was, Kafka Producer timeout counters. As
> I mentioned in other thread, we have a capacity issue with our Kafka cluster
> that ends up causing some timeout in our Flink Applications (we do have
> throttle in place in Kafka to manage it better but still we run into timeout
> pretty often right unfortunately).
>
> We had set our Kafka Producer retries to 10. It seems like that retry counter
> never gets reset. So over life of an App if it hits 10 timeouts, it basically
> couldn’t start and went to a Failed state. I am yet to dig into whether this
> can be solved from Flink Kafka wrapper or not. But, for now we have set the
> retries to 0 and hopefully this situation will not happen.
>
> If anyone has any similar observations pl feel free to share.
>
> Thanks, Ashish
>
>> On Jan 19, 2018, at 2:43 PM, ashish pok <ashish...@yahoo.com
>> <mailto:ashish...@yahoo.com>> wrote:
>>
>> Team,
>>
>> Hopefully, this is a quick one.
>>
>> We have setup restart strategy as follows in pretty much all of our apps:
>>
>> env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10,
>> Time.of(30, TimeUnit.SECONDS)));
>>
>> This seems pretty straight-forward. App should retry starting 10 times every
>> 30 seconds - so about 5 minutes. Either we are not understanding this or it
>> seems inconsistent. Some of the applications restart and come back fine on
>> issues like Kafka timeout (which I will come back to later) but in some
>> cases same issues pretty much shuts the app down.
>>
>> My first guess here was that total count of 10 is not reset after App
>> recovered normally. Is there a need to manually reset the counter in an App?
>> I doubt Flink would be treating it like a counter that spans the life of an
>> App instead of resetting on successful start-up - but not sure how else to
>> explain the behavior.
>>
>> Along the same line, what actually constitutes as a "restart"? Our Kafka
>> cluster has known performance bottlenecks during certain times of day that
>> we are working to resolve. I do notice Kafka producer timeouts quite a few
>> times during these times. When App hits these timeouts, it does recover fine
>> but I dont necessary see entire application restarting as I dont see
>> bootstrap logs of my App. Does something like this count as a restart of App
>> from Restart Strategy perspective as well vs things like apps crashes/Yarn
>> killing application etc. where App is actually restarted from scratch?
>>
>> We are really liking Flink, just need to hash out these operational issues
>> to make it prime time for all streaming apps we have in our cluster.
>>
>> Thanks,
>>
>> Ashish
>