IIRC this can be caused by the Carbon MAX_CREATES_PER_MINUTE setting.
I would deem it unlikely that the reporter thread is busy for 30 seconds.
On 11/08/2020 16:57, Nikola Hrusov wrote:
Hello,
I am doing some tests with flink 1.11.1 and I have noticed something
strange/wrong going on with the exported metrics.
I have a configuration like such:
/
metrics.reporter.graphite.class:
org.apache.flink.metrics.graphite.GraphiteReporterFactory
metrics.reporter.graphite.host: graphite
metrics.reporter.graphite.port: 8080
metrics.reporter.graphite.protocol: tcp
metrics.reporter.graphite.interval: 10 SECONDS/
which should produce metrics to graphite every 10 seconds.
And that works with low parallelism (e.g. <= 20). Then we get all
metrics, all the time, every 10th second.
However, when I scale my job to 200 parallelism or more, the metrics
are not sent every 10 seconds. Sometimes they are missing for up to 3
reporting cycles.
I have had a brief look in the code here:
https://github.com/apache/flink/blob/release-1.11.1/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L107-L144 and
it looks like there is a separate thread. That was my first guess, if
it is doing too much work on the same thread.
I have tried lowering the reporting interval from 10 SECONDS to 6-7
SECONDS, but even in that case there will be missing metrics. Even for
simpler jobs such as "source -> map -> sink" with higher parallelism
that would happen.
What can I do to further debug/make this work? Has anyone come across
this before?
Regards
,
Nikola Hrusov