[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-10 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033428#comment-17033428
 ] 

Gary Yao commented on FLINK-15918:
--

Yes, ticket is left open because I need to cherry-pick the changes to master.

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-10 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033427#comment-17033427
 ] 

Yu Li commented on FLINK-15918:
---

Are we leaving this JIRA open to cherry-pick the changes to master branch? 
[~gjy] Thanks.

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-07 Thread Gary Yao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032672#comment-17032672
 ] 

Gary Yao commented on FLINK-15918:
--

1.10: 1268a7b9a3a3d07f76ea1fe78a0b1a6a7d0ef7eb

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Lakshmi Rao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031002#comment-17031002
 ] 

Lakshmi Rao commented on FLINK-15918:
-

Just adding to what [~thw]  and [~shriya_a] mentioned, our users today get 
alerted on the uptime metric being zero. If the metric is not reset (which 
would mean we don't have time periods of it being zero when the job is not 
running), then it may mean that our users don't get alerts for when their job 
is down. 

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Shriya Arora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030960#comment-17030960
 ] 

Shriya Arora commented on FLINK-15918:
--

[~trohrmann] We have, in the past, experienced failure scenarios where a job 
crashed silently and abruptly, what I mean by that is that it did not exhibit 
other unhealthy symptoms like failing checkpoints, frequent restarts etc, and 
in that case uptime is an important metric to rely to know if the job is 
actually running. 

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Steven Zhen Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030901#comment-17030901
 ] 

Steven Zhen Wu commented on FLINK-15918:


quote an answer from our user, which is similar to what Thomas said.

"to detect a job that is not running. the reason to use uptime is to catch the 
case where the job is continually restarting, so it is mostly “up”, but never 
for a long time"

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Thomas Weise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030857#comment-17030857
 ] 

Thomas Weise commented on FLINK-15918:
--

[~trohrmann] our users rely on the metric to monitor their jobs. Continuously 
increasing uptime is interpreted as necessary for a healthy job. This 
assumption is broken with the recent change.

> This change of behavior exists since 1.9.0 if 
>{{jobmanager.execution.failover-strategy: region}} is configured, which we do 
>in the default flink-conf.yaml.  

And the change if behavior is not present in 1.9 when above setting is not 
present in flink-conf.yaml. It is, therefore, a regression for 1.10.

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030776#comment-17030776
 ] 

Till Rohrmann commented on FLINK-15918:
---

[~stevenz3wu] [~thw] do you use the {{uptime}} metric to monitor Flink jobs? If 
yes, then it would be helpful to understand how exactly you use it.

> Uptime Metric not reset on Job Restart
> --
>
> Key: FLINK-15918
> URL: https://issues.apache.org/jira/browse/FLINK-15918
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Gary Yao
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> *Description*
> The {{uptime}} metric is not reset when the job restarts, which is a change 
> in behavior compared to Flink 1.8.
> This change of behavior exists since 1.9.0 if 
> {{jobmanager.execution.failover-strategy: region}} is configured,
> which we do in the default flink-conf.yaml.
> *Workarounds*
> Users that find this behavior problematic can set {{jobmanager.scheduler: 
> legacy}} and unset {{jobmanager.execution.failover-strategy: region}} in 
> their {{flink-conf.yaml}}
> *How to reproduce*
> trivial
> *Expected behavior*
> This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)