[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063352#comment-18063352
]
Qinghui Xu commented on FLINK-21309:
I just dig a little deeper into the Flink metrics (esp `MetricReporter`) code
base, it seems the attributes such as `task_id` or `tm_id` are not exposed to
the reporter and there's no easy way of doing so. So here's a second way to
avoid suffixing `job` label:
* We will keep using HTTP PUT so that old (and thus discontinued) metrics will
always to dropped when new values are reported.
* Instead of suffixing `job` label, we will use the random UUID as a grouping
key "reporter_id" (or maybe you have better suggestion for the naming), so that
`job` labels are some meaningful literals (and easy to aggregate), shared among
taskmanagers and jobmanger, while there's no conflict among them when writing
metrics.
* [OPT] We may still want to keep the old behavior of suffixing `job` label,
in this case we use a feature flag to switch between the two ways of metric
grouping (job label suffixed vs. using "reporter_id" grouping key).
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063304#comment-18063304
]
Qinghui Xu commented on FLINK-21309:
Hello I just came across the same issue as described by [~jiguodai] that
metrics are erased among taskmanagers and jobmanager within the same Flink
cluster, when I tried to disable the random suffix of `job` lable because that
prevents an easy aggregation over the label.
I think I fully understand [~chesnay]'s concern regarding to a potential
overwhelming of pushgateway with an ever growing metric cardinality produced by
a long standing flink cluster, if we use POST instead of PUSH. But on the other
hand, using PUSH requires appending random suffix for `job` label (otherwise
it's unusable), which is against the [prometheus user guideline of its
usage|https://www.robustperception.io/what-is-a-job-label-for/], making it
inconvenient for aggregation (or I have to use some relable configs as a
workaround, which seems to me not ideal, neither).
Here's my suggestion for a proper fix:
* HTTP POST to pushgateway instead of PUT (without using job random suffix)
* `PrometheusPushGatewayReporter` should `DELETE` metrics from pushgateway
when metrics are unregistered, eg. when a task is removed from a taskamanger.
** Technically, as [~chesnay] mentioned, we can `DELETE` only by grouping keys
on pushgateway, so we will change a bit the way how we push metrics to
pushgateway: we will use `task_id` and `subtask_index` as grouping keys
(instead of plain labels) so that we can `DELETE` metric group with their
regards on pushgateway (dropping metrics of the whole subtask altogether).
Please let me know what you think.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335914#comment-17335914
]
Flink Jira Bot commented on FLINK-21309:
This issue was labeled "stale-major" 7 ago and has not received any updates so
it is being deprioritized. If this ticket is actually Major, please raise the
priority and ask a committer to assign you the issue or revive the public
discussion.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Labels: stale-major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327347#comment-17327347
]
Flink Jira Bot commented on FLINK-21309:
This major issue is unassigned and itself and all of its Sub-Tasks have not
been updated for 30 days. So, it has been labeled "stale-major". If this ticket
is indeed "major", please either assign yourself or give an update. Afterwards,
please remove the label. In 7 days the issue will be deprioritized.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Labels: stale-major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281058#comment-17281058
]
Chesnay Schepler commented on FLINK-21309:
--
Imagine a Flink session cluster where continuously new jobs are submitted
against. Or any cluster with a long-running streaming jobs.
Whenever a new job is submitted, or a restart occurs, the number of metrics
stored in the PushGateway grows. Existing metrics are never deleted; as this
only occurs when the JM/TM process (== the prometheus "job") shuts down. That
job you ran 2 months ago? It's metrics are still around. Your job restarted a
thousand times? Well those metrics from the first run are also still there.
If enough of these events occur the PushGateway will crash, be it due to being
out of memory or disk space.
We can neither guard against this by cleaning up metrics (because you can only
delete by grouping key, not labels), nor can users guard against this because
the PushGateway provides provides no hooks to cleanup up stale date.
The underlying issue is that the PushGateway is not meant for long-running
applications, but Flink jobs are usually exactly that. It is not a really good
fit; and as such some friction and inconveniences are to be expected and this
is one of them.
What you are proposing essentially boils down to consciously leaking resources
on the _assumption_ that it won't crash. And indeed it may work fine, until a
certain point, where it fails in the worst possible way.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pus
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281043#comment-17281043
]
jiguodai commented on FLINK-21309:
--
[~chesnay] It does't matter that "randomSuffix" is enabled by default, we can
set _*put*_ to true by default.
What puzzles me is why it will blow up days/weeks/months down the line when
"deleteOnShutdown" is enabled ? In my view , when set "deleteOnShutdown" to
true, metrics of specific flink job will be deleted after the job is canceled.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281023#comment-17281023
]
Chesnay Schepler commented on FLINK-21309:
--
I would be against it because it is one of those things that work initially but
can blow up days/weeks/months down the line.
FYI: I seem to have misremembered. The random suffix and deleteOnShutdown are
actually enabled by default, and as far as I can tell this was always the case.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281014#comment-17281014
]
jiguodai commented on FLINK-21309:
--
[~chesnay] Got it, What about adding a configuration in flink-conf.yaml for
users to choose _*put*_ or _*post*_ style, and when users choose _*post*_,
then "metrics.reporter.promgateway.deleteOnShutdown" will be set to true
forcibly. We can give detailed explanations in flink-conf.yaml about this new
configuration. :D
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280981#comment-17280981
]
Chesnay Schepler commented on FLINK-21309:
--
The random suffix option is not enabled by default because it was added after
the initial PushGateway reporter was added. Activating it now by default would
break existing setups, for example where users provide custom job names to
every process.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280979#comment-17280979
]
jiguodai commented on FLINK-21309:
--
[~chesnay]If so, "randomJobNameSuffix" will be a must-option instead of be
optional when somebody wanna use jobName as filter condition in PromQL. The
solution is confusing and we have no choice but to use regular expressions to
find metrics with exact jobName, which will degrade the speed of data retrieval
in prometheus.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280971#comment-17280971
]
jiguodai commented on FLINK-21309:
--
[~chesnay]If so, "randomJobNameSuffix" will be a must-option instead of be
optional when somebody wanna use jobName as filter condition in PromQL. The
solution is confusing and we have no choice but to use regular expressions to
find metrics with exact jobName, which will degrade the speed of data retrieval
in prometheus.
What about invoking callback function to delete metrics in pushgateway when JVM
stops or flink jobs are canceled? We can define a optional configuration in
flink-conf.yaml to support deleting metrics using jobName
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280881#comment-17280881
]
Chesnay Schepler commented on FLINK-21309:
--
1) Use the option that appends a random suffix.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280718#comment-17280718
]
jiguodai commented on FLINK-21309:
--
[~chesnay] Thanks for your reply.
The first problem is that metrics of JobManager and TaskManager overwrite each
other in pushgateway and the graph on grafana is discontinuous;
The second problem is that metrics in pushgateway keep piling up over time;
As far as I am concerned, the first problem is pretty much more urgent to
solve, any way, we can develop a callback script or crontab script to delete
metrics in pushgateway.
What' more, i think what you’ve said "various labels increase in cardinality
over time" is a normal phenomenon,it should be like that. Different processes
or taskmanagers should have different metrics in pushgateway at the same time
instead of overwriting each other.
So,how do you solve the first problem ?
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280706#comment-17280706
]
Chesnay Schepler commented on FLINK-21309:
--
Doing a POST is a dangerous approach. Various labels increase in cardinality
over time, like process IDs, (Flink) job IDs (for session clusters) or attempt
numbers.
With a POST we'd end up adding more and more metrics to the PushGateway,
eventually overloading it, because nothing is ever deleted. As such, I would
recommend you to stop instructing other people to use POST, as it may impact
the stability of their infrastructure.
PUT allows us to remove old metrics from the PushGateway that no longer exists.
Unfortunately the PushGateway does not provide us with any better means of
deleting old metrics, like a time-to-live parameter or being able to delete
metrics by label.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Commented] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280055#comment-17280055
]
jiguodai commented on FLINK-21309:
--
my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
