I think you are misunderstanding a few things.
a) when you include a variable in the scope format, then Flink fills
that in /before/ it reaches Datadog. If you set it to
"flink.<job_name>", then what we send to Datadog is "flink.myAwesomeJob".
b) the exception you see is not coming from Datadog. They occur because,
based on the configured scope formats, metrics from different jobs
running in the same JobManager resolve to the same name (the standby
jobmanger is irrelevant). Flink rejects these metrics, because if were
to send these out you'd get funny results in Datadog because all jobs
would try to report the same metric.
In short, you need to include the job id or job name in the
metrics.scope.jm.job scope formats.
On 13/10/2021 06:39, Clemens Valiente wrote:
Hi,
we are using datadog as our metrics reporter as documented here:
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog
our jobmanager scope is
metrics.scope.jm <http://metrics.scope.jm>: flink.jobmanager
metrics.scope.jm.job: flink.jobmanager
since datadog doesn't allow placeholder in metric names, we cannot
include the <host> or <job_name> placeholder in the scope.
This setup worked nicely on our standalone kubernetes application
deployment without using HA.
But when we set up HA, we lost checkpointing metrics in datadog, and
see this warning in the jobmanager log:
2021-10-01 04:22:09,920 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'totalNumberOfCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,920 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'numberOfInProgressCheckpoints'. Metric will not be reported.[flink,
jobmanager]
2021-10-01 04:22:09,920 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'numberOfCompletedCheckpoints'. Metric will not be reported.[flink,
jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'numberOfFailedCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointRestoreTimestamp'. Metric will not be reported.[flink,
jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointSize'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointDuration'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointProcessedData'. Metric will not be reported.[flink,
jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointPersistedData'. Metric will not be reported.[flink,
jobmanager]
2021-10-01 04:22:09,921 WARN org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a Metric with the
name'lastCheckpointExternalPath'. Metric will not be reported.[flink,
jobmanager]
I assume this is because we now have two jobmanager pods (one active
one standby) and they both report this metric, it fails. but we cannot
use the <host> scope in the group, otherwise we won't be able to build
datadog dashboards conveniently.
My question:
- did anyone else encounter this problem?
- how could we solve this to have checkpointing metrics again in HA
mode without needing the <host> placeholder?
Thanks a lot
Clemens
By communicating with Grab Inc and/or its subsidiaries, associate
companies and jointly controlled entities (“Grab Group”), you are
deemed to have consented to the processing of your personal data as
set out in the Privacy Notice which can be viewed at
https://grab.com/privacy/
This email contains confidential information and is only for the
intended recipient(s). If you are not the intended recipient(s),
please do not disseminate, distribute or copy this email Please notify
Grab Group immediately if you have received this by mistake and delete
this email from your system. Email transmission cannot be guaranteed
to be secure or error-free as any information therein could be
intercepted, corrupted, lost, destroyed, delayed or incomplete, or
contain viruses. Grab Group do not accept liability for any errors or
omissions in the contents of this email arises as a result of email
transmission. All intellectual property rights in this email and
attachments therein shall remain vested in Grab Group, unless
otherwise provided by law.