Zhu Zhu, Sorry, I was using different terminology. yes, Flink meter is what I was talking about regarding "fullRestarts" for threshold based alerting.
On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote: > Steven, > > In my mind, Flink counter only stores its accumulated count and reports > that value. Are you using an external counter directly? > Maybe Flink Meter/MeterView is what you need? It stores the count and > calculates the rate. And it will report its "count" as well as "rate" to > external metric services. > > The counter "task_failures" only works if the individual failover strategy > is enabled. However, it is not a public interface and is not suggested to > use, as the fine grained recovery (region failover) now supersedes it. > I've opened a ticket[1] to add a metric to show failovers that respects > fine grained recovery. > > [1] https://issues.apache.org/jira/browse/FLINK-14164 > > Thanks, > Zhu Zhu > > Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道: > >> >> When we setup alert like "fullRestarts > 1" for some rolling window, we >> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >> after a first full restart. So alert condition will always be true after >> first job restart. If we can apply a derivative to the Gauge value, I guess >> alert can probably work. I can explore if that is an option or not. >> >> Yeah. Understood that "fullRestart" won't increment when fine grained >> recovery happened. I think "task_failures" counter already exists in Flink. >> >> >> >> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: >> >>> Steven, >>> >>> Thanks for the information. If we can determine this a common issue, we >>> can solve it in Flink core. >>> To get to that state, I have two questions which need your help: >>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>> Gauge<Long>. Does the metric reporter you use report Counter and >>> Gauge<Long> to external services in different ways? Or anything else can be >>> different due to the metric type? >>> 2. Is the "number of restarts" what you actually need, rather than >>> the "fullRestart" count? If so, I believe we will have such a counter >>> metric in 1.10, since the previous "fullRestart" metric value is not the >>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>> "fullRestart" reveals how many times entire job graph has been >>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>> would not be restarted when task failures happen and the "fullRestart" >>> value will not increment in such cases. >>> >>> I'd appreciate if you can help with these questions and we can make >>> better decisions for Flink. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: >>> >>>> Zhu Zhu, >>>> >>>> Flink fullRestart metric is a Gauge, which is not good for alerting on. >>>> We publish an equivalent Counter metric for alerting purpose. >>>> >>>> Thanks, >>>> Steven >>>> >>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >>>> >>>>> Thanks Steven for the feedback! >>>>> Could you share more information about the metrics you add in you >>>>> customized restart strategy? >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>>>> >>>>>> We do use config like "restart-strategy: >>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>> metrics than the Flink provided ones. >>>>>> >>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>> >>>>>>> Thanks everyone for the input. >>>>>>> >>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>> interface as it is not explicitly documented. >>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>> scheduler in Flink 1.10 >>>>>>> >>>>>>> Other usages are still supported, including all the strategies and >>>>>>> configuring ways described in >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>> . >>>>>>> >>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>>>> >>>>>>>> Thanks Oytun for the reply! >>>>>>>> >>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>> >>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>> with the new scheduler. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>>>> >>>>>>>>> Hi Zhu, >>>>>>>>> >>>>>>>>> We are using custom restart strategy like this: >>>>>>>>> >>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>> >>>>>>>>> >>>>>>>>> --- >>>>>>>>> Oytun Tez >>>>>>>>> >>>>>>>>> *M O T A W O R D* >>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>> oy...@motaword.com — www.motaword.com >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>>> >>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>>>> re-design >>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy >>>>>>>>>> will not >>>>>>>>>> work any more with the new scheduler. >>>>>>>>>> >>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>> >>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>> using customized RestartStrategy. That will be valuable for use to >>>>>>>>>> make >>>>>>>>>> decisions. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>>