[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033428#comment-17033428 ] Gary Yao commented on FLINK-15918: -- Yes, ticket is left open because I need to cherry-pick the changes to master. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033427#comment-17033427 ] Yu Li commented on FLINK-15918: --- Are we leaving this JIRA open to cherry-pick the changes to master branch? [~gjy] Thanks. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032672#comment-17032672 ] Gary Yao commented on FLINK-15918: -- 1.10: 1268a7b9a3a3d07f76ea1fe78a0b1a6a7d0ef7eb > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031002#comment-17031002 ] Lakshmi Rao commented on FLINK-15918: - Just adding to what [~thw] and [~shriya_a] mentioned, our users today get alerted on the uptime metric being zero. If the metric is not reset (which would mean we don't have time periods of it being zero when the job is not running), then it may mean that our users don't get alerts for when their job is down. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Priority: Major > Fix For: 1.10.1, 1.11.0 > > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030960#comment-17030960 ] Shriya Arora commented on FLINK-15918: -- [~trohrmann] We have, in the past, experienced failure scenarios where a job crashed silently and abruptly, what I mean by that is that it did not exhibit other unhealthy symptoms like failing checkpoints, frequent restarts etc, and in that case uptime is an important metric to rely to know if the job is actually running. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Priority: Major > Fix For: 1.10.1, 1.11.0 > > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030901#comment-17030901 ] Steven Zhen Wu commented on FLINK-15918: quote an answer from our user, which is similar to what Thomas said. "to detect a job that is not running. the reason to use uptime is to catch the case where the job is continually restarting, so it is mostly “up”, but never for a long time" > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Priority: Major > Fix For: 1.10.1, 1.11.0 > > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030857#comment-17030857 ] Thomas Weise commented on FLINK-15918: -- [~trohrmann] our users rely on the metric to monitor their jobs. Continuously increasing uptime is interpreted as necessary for a healthy job. This assumption is broken with the recent change. > This change of behavior exists since 1.9.0 if >{{jobmanager.execution.failover-strategy: region}} is configured, which we do >in the default flink-conf.yaml. And the change if behavior is not present in 1.9 when above setting is not present in flink-conf.yaml. It is, therefore, a regression for 1.10. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Priority: Major > Fix For: 1.10.1, 1.11.0 > > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart
[ https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030776#comment-17030776 ] Till Rohrmann commented on FLINK-15918: --- [~stevenz3wu] [~thw] do you use the {{uptime}} metric to monitor Flink jobs? If yes, then it would be helpful to understand how exactly you use it. > Uptime Metric not reset on Job Restart > -- > > Key: FLINK-15918 > URL: https://issues.apache.org/jira/browse/FLINK-15918 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0 >Reporter: Gary Yao >Priority: Major > Fix For: 1.10.1, 1.11.0 > > > *Description* > The {{uptime}} metric is not reset when the job restarts, which is a change > in behavior compared to Flink 1.8. > This change of behavior exists since 1.9.0 if > {{jobmanager.execution.failover-strategy: region}} is configured, > which we do in the default flink-conf.yaml. > *Workarounds* > Users that find this behavior problematic can set {{jobmanager.scheduler: > legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in > their {{flink-conf.yaml}} > *How to reproduce* > trivial > *Expected behavior* > This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)