The reported exception looks quite similar to the one in this thread <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Datadog-reporter-timeout-amp-OOM-issue-tt40997.html#a41010>, which was supposedly caused by Datadog rate limits but I don't think this was thoroughly investigated. (bear in mind that each container has its own reporter; with the default reporting interval of 10 seconds you quickly reach fairly high reports/second rates)

Alternatively it could just be plain connectivity issues.

If the issues do not persist for a long time then no metrics /should /be lost however, so you may be able to ignore them.


On 2/2/2021 7:31 PM, Claude M wrote:

Hello,

I have a Flink jobmanager and taskmanagers deployed in a Kubernetes cluster.  I integrated it with Datadog by having the following specified in the flink-conf.yaml.

metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: <DD_API_KEY>

However, I'm seeing random timeouts in the log and don't know why this is occurring and how to solve the issue.  Please see attached file showing the error.


Thanks





Reply via email to