[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063352#comment-18063352
]
Qinghui Xu edited comment on FLINK-21309 at 3/5/26 9:02 PM:
I just dig a little deeper into the Flink metrics (esp `MetricReporter`) code
base, it seems the attributes such as `task_id` or `tm_id` are not exposed to
the reporter and there's no easy way of doing so. So here's a second way to
avoid suffixing `job` label:
* We will keep using HTTP PUT so that old (and thus discontinued) metrics will
always be dropped when new values are reported, keeping cardinality stable on
prometheus pushgateway.
* Instead of suffixing `job` label, we will use the random UUID as a grouping
key "reporter_id" (or maybe you have better suggestion for the naming), so that
`job` labels are some meaningful literals (and easy to aggregate), shared among
taskmanagers and jobmanger, while there's no conflict among them when writing
metrics.
* [OPT] We may still want to keep the old behavior of suffixing `job` label,
in this case we use a feature flag to switch between the two ways of metric
grouping (job label suffixed vs. using "reporter_id" grouping key).
was (Author: q.xu):
I just dig a little deeper into the Flink metrics (esp `MetricReporter`) code
base, it seems the attributes such as `task_id` or `tm_id` are not exposed to
the reporter and there's no easy way of doing so. So here's a second way to
avoid suffixing `job` label:
* We will keep using HTTP PUT so that old (and thus discontinued) metrics will
always to dropped when new values are reported.
* Instead of suffixing `job` label, we will use the random UUID as a grouping
key "reporter_id" (or maybe you have better suggestion for the naming), so that
`job` labels are some meaningful literals (and easy to aggregate), shared among
taskmanagers and jobmanger, while there's no conflict among them when writing
metrics.
* [OPT] We may still want to keep the old behavior of suffixing `job` label,
in this case we use a feature flag to switch between the two ways of metric
grouping (job label suffixed vs. using "reporter_id" grouping key).
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random
[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063304#comment-18063304
]
Qinghui Xu edited comment on FLINK-21309 at 3/5/26 6:39 PM:
Hello I just came across the same issue as described by [~jiguodai] that
metrics are erased among taskmanagers and jobmanager within the same Flink
cluster, when I tried to disable the random suffix of `job` lable because that
prevents an easy aggregation over the label.
I think I fully understand [~chesnay]'s concern regarding to a potential
overwhelming of pushgateway with an ever growing metric cardinality produced by
a long standing flink cluster, if we use POST instead of PUSH. But on the other
hand, using PUSH requires appending random suffix for `job` label (otherwise
it's unusable), which is against the [prometheus user guideline of its
usage|https://www.robustperception.io/what-is-a-job-label-for/], making it
inconvenient for aggregation (or I have to use some relable configs as a
workaround, which seems to me not ideal, neither).
Here's my suggestion for a proper fix:
* HTTP POST to pushgateway instead of PUT (without using job random suffix)
* `PrometheusPushGatewayReporter` should `DELETE` metrics from pushgateway
when metrics are unregistered, eg. when a task is removed from a taskamanger.
** Technically, as [~chesnay] mentioned, we can `DELETE` only by grouping keys
on pushgateway, so we will change a bit the way how we push metrics to
pushgateway: we will use `task_id` and `subtask_index` as grouping keys
(instead of plain labels) so that we can `DELETE` metric group with their
regards on pushgateway (dropping metrics of the whole subtask altogether).
* [OPT] We can use a feature flag to choose `PUT` or `POST` (I believe we
should always append random suffix when using `PUT`, by removing the `
randomJobNameSuffix`, if we keep this option)
Please let me know what you think.
was (Author: q.xu):
Hello I just came across the same issue as described by [~jiguodai] that
metrics are erased among taskmanagers and jobmanager within the same Flink
cluster, when I tried to disable the random suffix of `job` lable because that
prevents an easy aggregation over the label.
I think I fully understand [~chesnay]'s concern regarding to a potential
overwhelming of pushgateway with an ever growing metric cardinality produced by
a long standing flink cluster, if we use POST instead of PUSH. But on the other
hand, using PUSH requires appending random suffix for `job` label (otherwise
it's unusable), which is against the [prometheus user guideline of its
usage|https://www.robustperception.io/what-is-a-job-label-for/], making it
inconvenient for aggregation (or I have to use some relable configs as a
workaround, which seems to me not ideal, neither).
Here's my suggestion for a proper fix:
* HTTP POST to pushgateway instead of PUT (without using job random suffix)
* `PrometheusPushGatewayReporter` should `DELETE` metrics from pushgateway
when metrics are unregistered, eg. when a task is removed from a taskamanger.
** Technically, as [~chesnay] mentioned, we can `DELETE` only by grouping keys
on pushgateway, so we will change a bit the way how we push metrics to
pushgateway: we will use `task_id` and `subtask_index` as grouping keys
(instead of plain labels) so that we can `DELETE` metric group with their
regards on pushgateway (dropping metrics of the whole subtask altogether).
Please let me know what you think.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternatel
[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063304#comment-18063304
]
Qinghui Xu edited comment on FLINK-21309 at 3/5/26 6:39 PM:
Hello I just came across the same issue as described by [~jiguodai] that
metrics are erased among taskmanagers and jobmanager within the same Flink
cluster, when I tried to disable the random suffix of `job` lable because that
prevents an easy aggregation over the label.
I think I fully understand [~chesnay]'s concern regarding to a potential
overwhelming of pushgateway with an ever growing metric cardinality produced by
a long standing flink cluster, if we use POST instead of PUSH. But on the other
hand, using PUSH requires appending random suffix for `job` label (otherwise
it's unusable), which is against the [prometheus user guideline of its
usage|https://www.robustperception.io/what-is-a-job-label-for/], making it
inconvenient for aggregation (or I have to use some relable configs as a
workaround, which seems to me not ideal, neither).
Here's my suggestion for a proper fix:
* HTTP POST to pushgateway instead of PUT (without using job random suffix)
* `PrometheusPushGatewayReporter` should `DELETE` metrics from pushgateway
when metrics are unregistered, eg. when a task is removed from a taskamanger.
** Technically, as [~chesnay] mentioned, we can `DELETE` only by grouping keys
on pushgateway, so we will change a bit the way how we push metrics to
pushgateway: we will use `task_id` and `subtask_index` as grouping keys
(instead of plain labels) so that we can `DELETE` metric group with their
regards on pushgateway (dropping metrics of the whole subtask altogether).
* [OPT] We can use a feature flag to choose `PUT` or `POST` (I believe we
should always append random suffix when using `PUT`, by removing the
`randomJobNameSuffix`, if we keep this option)
Please let me know what you think.
was (Author: q.xu):
Hello I just came across the same issue as described by [~jiguodai] that
metrics are erased among taskmanagers and jobmanager within the same Flink
cluster, when I tried to disable the random suffix of `job` lable because that
prevents an easy aggregation over the label.
I think I fully understand [~chesnay]'s concern regarding to a potential
overwhelming of pushgateway with an ever growing metric cardinality produced by
a long standing flink cluster, if we use POST instead of PUSH. But on the other
hand, using PUSH requires appending random suffix for `job` label (otherwise
it's unusable), which is against the [prometheus user guideline of its
usage|https://www.robustperception.io/what-is-a-job-label-for/], making it
inconvenient for aggregation (or I have to use some relable configs as a
workaround, which seems to me not ideal, neither).
Here's my suggestion for a proper fix:
* HTTP POST to pushgateway instead of PUT (without using job random suffix)
* `PrometheusPushGatewayReporter` should `DELETE` metrics from pushgateway
when metrics are unregistered, eg. when a task is removed from a taskamanger.
** Technically, as [~chesnay] mentioned, we can `DELETE` only by grouping keys
on pushgateway, so we will change a bit the way how we push metrics to
pushgateway: we will use `task_id` and `subtask_index` as grouping keys
(instead of plain labels) so that we can `DELETE` metric group with their
regards on pushgateway (dropping metrics of the whole subtask altogether).
* [OPT] We can use a feature flag to choose `PUT` or `POST` (I believe we
should always append random suffix when using `PUT`, by removing the `
randomJobNameSuffix`, if we keep this option)
Please let me know what you think.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can
[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280055#comment-17280055
]
jiguodai edited comment on FLINK-21309 at 2/6/21, 3:05 AM:
---
[~chesnay] my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}{code}
if this solution is useful, it's my pleasure to take this Jira ticket, thanks.
was (Author: jiguodai):
[~chesnay] my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280055#comment-17280055
]
jiguodai edited comment on FLINK-21309 at 2/6/21, 3:03 AM:
---
[~chesnay] my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
was (Author: jiguodai):
[~chesnay] my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
[jira] [Comment Edited] (FLINK-21309) Metrics of JobManager and TaskManager overwrite each other in pushgateway
[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280055#comment-17280055
]
jiguodai edited comment on FLINK-21309 at 2/6/21, 2:51 AM:
---
[~chesnay] my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
was (Author: jiguodai):
my solution is as below :
{code:java}
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter
implements Scheduled {
@Override public void report() {
try {
// change push to pushAdd
pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName,
groupingKey);
} catch (Exception e) {
log.warn("Failed to push metrics to PushGateway with jobName {},
groupingKey {}.", jobName, groupingKey, e);
}
}
}
{code}
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
>Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
>Reporter: jiguodai
>Priority: Major
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/{//}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
