Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-20 Thread Chesnay Schepler
What version are you using, and if you are using 1.13+, are you using 
the adaptive scheduler or reactive mode?


On 20/10/2021 07:39, Clemens Valiente wrote:

Hi Chesnay,
thanks a lot for the clarification.
We managed to resolve the collision, and isolated a problem to the 
metrics themselves.


Using the REST API at /jobs//metrics?get=uptime
the response is [{"id":"uptime","value":"-1"}]
despite the job running and processing data for 5 days at that point. 
All task,taskmanager, and jobmanager related metrics seem fine, only 
the job metrics are incorrect. Basically all of these do not have 
correct metrics:

[{"id":"numberOfFailedCheckpoints"},{"id":"lastCheckpointSize"},{"id":"lastCheckpointExternalPath"},{"id":"totalNumberOfCheckpoints"},{"id":"lastCheckpointRestoreTimestamp"},{"id":"uptime"},{"id":"restartingTime"},{"id":"numberOfInProgressCheckpoints"},{"id":"downtime"},{"id":"numberOfCompletedCheckpoints"},{"id":"lastCheckpointProcessedData"},{"id":"fullRestarts"},{"id":"lastCheckpointDuration"},{"id":"lastCheckpointPersistedData"}]
Looking at the Gauge the only way it can return -1 is 
when isTerminalState() is true which I don't think can be the case in 
a running application.

Do you know where we can check on what went wrong?

Best Regards
Clemens


On Thu, Oct 14, 2021 at 8:55 PM Chesnay Schepler  
wrote:


I think you are misunderstanding a few things.

a) when you include a variable in the scope format, then Flink
fills that in /before/ it reaches Datadog. If you set it to
"flink.", then what we send to Datadog is
"flink.myAwesomeJob".
b) the exception you see is not coming from Datadog. They occur
because, based on the configured scope formats, metrics from
different jobs running in the same JobManager resolve to the same
name (the standby jobmanger is irrelevant). Flink rejects these
metrics, because if were to send these out you'd get funny results
in Datadog because all jobs would try to report the same metric.

In short, you need to include the job id or job name in the
metrics.scope.jm.job scope formats.

On 13/10/2021 06:39, Clemens Valiente wrote:

Hi,

we are using datadog as our metrics reporter as documented here:

https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog

our jobmanager scope is
metrics.scope.jm : flink.jobmanager
    metrics.scope.jm.job: flink.jobmanager
since datadog doesn't allow placeholder in metric names, we
cannot include the  or  placeholder in the scope.

This setup worked nicely on our standalone kubernetes application
deployment without using HA.
But when we set up HA, we lost checkpointing metrics in datadog,
and see this warning in the jobmanager log:
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'totalNumberOfCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfInProgressCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfCompletedCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfFailedCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointRestoreTimestamp'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointSize'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointDuration'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointProcessedData'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointPersistedData'. Metric will not be reported.[flink, 
jobmanager]

Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-19 Thread Clemens Valiente
Hi Chesnay,
thanks a lot for the clarification.
We managed to resolve the collision, and isolated a problem to the metrics
themselves.

Using the REST API at /jobs//metrics?get=uptime
the response is [{"id":"uptime","value":"-1"}]
despite the job running and processing data for 5 days at that point. All
task,taskmanager, and jobmanager related metrics seem fine, only the job
metrics are incorrect. Basically all of these do not have correct metrics:

[{"id":"numberOfFailedCheckpoints"},{"id":"lastCheckpointSize"},{"id":"lastCheckpointExternalPath"},{"id":"totalNumberOfCheckpoints"},{"id":"lastCheckpointRestoreTimestamp"},{"id":"uptime"},{"id":"restartingTime"},{"id":"numberOfInProgressCheckpoints"},{"id":"downtime"},{"id":"numberOfCompletedCheckpoints"},{"id":"lastCheckpointProcessedData"},{"id":"fullRestarts"},{"id":"lastCheckpointDuration"},{"id":"lastCheckpointPersistedData"}]

Looking at the Gauge the only way it can return -1 is
when isTerminalState() is true which I don't think can be the case in a
running application.
Do you know where we can check on what went wrong?

Best Regards
Clemens


On Thu, Oct 14, 2021 at 8:55 PM Chesnay Schepler  wrote:

> I think you are misunderstanding a few things.
>
> a) when you include a variable in the scope format, then Flink fills that
> in *before* it reaches Datadog. If you set it to "flink.", then
> what we send to Datadog is "flink.myAwesomeJob".
> b) the exception you see is not coming from Datadog. They occur because,
> based on the configured scope formats, metrics from different jobs running
> in the same JobManager resolve to the same name (the standby jobmanger is
> irrelevant). Flink rejects these metrics, because if were to send these out
> you'd get funny results in Datadog because all jobs would try to report the
> same metric.
>
> In short, you need to include the job id or job name in the
> metrics.scope.jm.job scope formats.
>
> On 13/10/2021 06:39, Clemens Valiente wrote:
>
> Hi,
>
> we are using datadog as our metrics reporter as documented here:
>
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog
>
> our jobmanager scope is
> metrics.scope.jm: flink.jobmanager
> metrics.scope.jm.job: flink.jobmanager
> since datadog doesn't allow placeholder in metric names, we cannot include
> the  or  placeholder in the scope.
>
> This setup worked nicely on our standalone kubernetes application
> deployment without using HA.
> But when we set up HA, we lost checkpointing metrics in datadog, and see
> this warning in the jobmanager log:
>
> 2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'totalNumberOfCheckpoints'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'numberOfInProgressCheckpoints'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'numberOfCompletedCheckpoints'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'numberOfFailedCheckpoints'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointRestoreTimestamp'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointSize'. Metric will not be reported.[flink, jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointDuration'. Metric will not be reported.[flink, jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointProcessedData'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointPersistedData'. Metric will not be reported.[flink, 
> jobmanager]
> 2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
>  [] - Name collision: Group already contains a Metric with the 
> name 'lastCheckpointExternalPath'. 

Re: [External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-14 Thread Chesnay Schepler

I think you are misunderstanding a few things.

a) when you include a variable in the scope format, then Flink fills 
that in /before/ it reaches Datadog. If you set it to 
"flink.", then what we send to Datadog is "flink.myAwesomeJob".
b) the exception you see is not coming from Datadog. They occur because, 
based on the configured scope formats, metrics from different jobs 
running in the same JobManager resolve to the same name (the standby 
jobmanger is irrelevant). Flink rejects these metrics, because if were 
to send these out you'd get funny results in Datadog because all jobs 
would try to report the same metric.


In short, you need to include the job id or job name in the 
metrics.scope.jm.job scope formats.


On 13/10/2021 06:39, Clemens Valiente wrote:

Hi,

we are using datadog as our metrics reporter as documented here:
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog

our jobmanager scope is
metrics.scope.jm : flink.jobmanager
    metrics.scope.jm.job: flink.jobmanager
since datadog doesn't allow placeholder in metric names, we cannot 
include the  or  placeholder in the scope.


This setup worked nicely on our standalone kubernetes application 
deployment without using HA.
But when we set up HA, we lost checkpointing metrics in datadog, and 
see this warning in the jobmanager log:

2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'totalNumberOfCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfInProgressCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfCompletedCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'numberOfFailedCheckpoints'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointRestoreTimestamp'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointSize'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointDuration'. Metric will not be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointProcessedData'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointPersistedData'. Metric will not be reported.[flink, 
jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup  
   [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointExternalPath'. Metric will not be reported.[flink, 
jobmanager]

I assume this is because we now have two jobmanager pods (one active 
one standby) and they both report this metric, it fails. but we cannot 
use the  scope in the group, otherwise we won't be able to build 
datadog dashboards conveniently.


My question:
- did anyone else encounter this problem?
- how could we solve this to have checkpointing metrics again in HA 
mode without needing the  placeholder?


Thanks a lot
Clemens


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are 
deemed to have consented to the processing of your personal data as 
set out in the Privacy Notice which can be viewed at 
https://grab.com/privacy/


This email contains confidential information and is only for the 
intended recipient(s). If you are not the intended recipient(s), 
please do not disseminate, distribute or copy this email Please notify 
Grab Group immediately if you have received this by mistake and delete 
this email from your system. Email transmission cannot be guaranteed 
to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or 
contain viruses. Grab Group do not 

[External] metric collision using datadog and standalone Kubernetes HA mode

2021-10-12 Thread Clemens Valiente
Hi,

we are using datadog as our metrics reporter as documented here:
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog

our jobmanager scope is
metrics.scope.jm: flink.jobmanager
metrics.scope.jm.job: flink.jobmanager
since datadog doesn't allow placeholder in metric names, we cannot include
the  or  placeholder in the scope.

This setup worked nicely on our standalone kubernetes application
deployment without using HA.
But when we set up HA, we lost checkpointing metrics in datadog, and see
this warning in the jobmanager log:

2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'totalNumberOfCheckpoints'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'numberOfInProgressCheckpoints'. Metric will not
be reported.[flink, jobmanager]
2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'numberOfCompletedCheckpoints'. Metric will not
be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'numberOfFailedCheckpoints'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointRestoreTimestamp'. Metric will not
be reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointSize'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointDuration'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointProcessedData'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointPersistedData'. Metric will not be
reported.[flink, jobmanager]
2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup
[] - Name collision: Group already contains a
Metric with the name 'lastCheckpointExternalPath'. Metric will not be
reported.[flink, jobmanager]


I assume this is because we now have two jobmanager pods (one active one
standby) and they both report this metric, it fails. but we cannot use the
 scope in the group, otherwise we won't be able to build datadog
dashboards conveniently.

My question:
- did anyone else encounter this problem?
- how could we solve this to have checkpointing metrics again in HA mode
without needing the  placeholder?

Thanks a lot
Clemens

-- 


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are deemed to 
have consented to the processing of your personal data as set out in the 
Privacy Notice which can be viewed at https://grab.com/privacy/ 



This email contains confidential information 
and is only for the intended recipient(s). If you are not the intended 
recipient(s), please do not disseminate, distribute or copy this email 
Please notify Grab Group immediately if you have received this by mistake 
and delete this email from your system. Email transmission cannot be 
guaranteed to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain 
viruses. Grab Group do not accept liability for any errors or omissions in 
the contents of this email arises as a result of email transmission. All 
intellectual property rights in this email and attachments therein shall 
remain vested in Grab Group, unless otherwise provided by law.