Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu Tue, 24 Sep 2019 11:31:14 -0700

Zhu Zhu,

Sorry, I was using different terminology. yes, Flink meter is what I was
talking about regarding "fullRestarts" for threshold based alerting.


On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote:

> Steven,
>
> In my mind, Flink counter only stores its accumulated count and reports
> that value. Are you using an external counter directly?
> Maybe Flink Meter/MeterView is what you need? It stores the count and
> calculates the rate. And it will report its "count" as well as "rate" to
> external metric services.
>
> The counter "task_failures" only works if the individual failover strategy
> is enabled. However, it is not a public interface and is not suggested to
> use, as the fine grained recovery (region failover) now supersedes it.
> I've opened a ticket[1] to add a metric to show failovers that respects
> fine grained recovery.
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道：
>
>>
>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>> after a first full restart. So alert condition will always be true after
>> first job restart. If we can apply a derivative to the Gauge value, I guess
>> alert can probably work. I can explore if that is an option or not.
>>
>> Yeah. Understood that "fullRestart" won't increment when fine grained
>> recovery happened. I think "task_failures" counter already exists in Flink.
>>
>>
>>
>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote:
>>
>>> Steven,
>>>
>>> Thanks for the information. If we can determine this a common issue, we
>>> can solve it in Flink core.
>>> To get to that state, I have two questions which need your help:
>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>> Gauge<Long> to external services in different ways? Or anything else can be
>>> different due to the metric type?
>>> 2. Is the "number of restarts" what you actually need, rather than
>>> the "fullRestart" count? If so, I believe we will have such a counter
>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>     "fullRestart" reveals how many times entire job graph has been
>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>> would not be restarted when task failures happen and the "fullRestart"
>>> value will not increment in such cases.
>>>
>>> I'd appreciate if you can help with these questions and we can make
>>> better decisions for Flink.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道：
>>>
>>>> Zhu Zhu,
>>>>
>>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>>> We publish an equivalent Counter metric for alerting purpose.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>
>>>>> Thanks Steven for the feedback!
>>>>> Could you share more information about the metrics you add in you
>>>>> customized restart strategy?
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道：
>>>>>
>>>>>> We do use config like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>> metrics than the Flink provided ones.
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks everyone for the input.
>>>>>>>
>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>> interface as it is not explicitly documented.
>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>> scheduler in Flink 1.10
>>>>>>>
>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>> configuring ways described in
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>> .
>>>>>>>
>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道：
>>>>>>>
>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>
>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>
>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>> with the new scheduler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道：
>>>>>>>>
>>>>>>>>> Hi Zhu,
>>>>>>>>>
>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>
>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Oytun Tez
>>>>>>>>>
>>>>>>>>> *M O T A W O R D*
>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>> oy...@motaword.com — www.motaword.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>
>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>>>> re-design
>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy 
>>>>>>>>>> will not
>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>
>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to 
>>>>>>>>>> make
>>>>>>>>>> decisions.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Reply via email to