Re: Datadog reporter timeout & OOM issue

Juha Mynttinen Tue, 26 Jan 2021 22:51:02 -0800

Hey,

A few months back, I had a very similar problem with Datadog when I tried
to do a proof of concept using it with Flink. I had quite a lot of user
defined metrics. I got similar exceptions and the metrics didn't end up in
Datadog. Without too much deeper analysis, I assumed Datadog was throttling
the incoming traffic.


Back then it was also difficult (?) to configure the Datadog region
(eu/us). If I remember correctly the region was more or less hardcoded to
US. That seems to be fixed now, there's the param
metrics.reporter.dghttp.dataCenter to define the region.

Regards,
Juha

El mié, 27 ene 2021 a las 6:53, Xingcan Cui (<xingc...@gmail.com>) escribió:

> Hi all,
>
> Recently, I tried to use the Datadog reporter to collect some user-defined
> metrics. Sometimes when reaching traffic peaks (which are also peaks for
> metrics), the HTTP client will throw the following exception:
>
> ```
> [OkHttp https://app.datadoghq.com/...] WARN
>  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending
> request to Datadog
> java.net.SocketTimeoutException: timeout
> at
> okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
> at
> okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
> at
> okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
> at
> okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
> at
> okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> ```
>
> I guess this may be caused by the rate limit of the Datadog server since
> too many HTTP requests look like a kind of "attack". The real problem is
> that after throwing the above exceptions, the JVM heap size of the
> taskmanager starts to increase and finally causes OOM. I'm curious if this
> may be caused by metrics accumulation, i.e., for some reason, the client
> can't reconnect to the Datadog server and send the metrics so that the
> metrics data is buffered in memory and causes OOM.
>
> I'm running Flink 1.11.2 on EMR-6.2.0 with
> flink-metrics-datadog-1.11.2.jar.
>
> Thanks,
> Xingcan
>

Re: Datadog reporter timeout & OOM issue

Reply via email to