Hello,

the MetricQueryService is used by the webUI to fetch fetch metrics from the JobManager and all TaskManagers. It is only used when the
webUI is accessed.

Based on the logs you gave the TaskManager isn't killed by the JobManager; instead the JobManager only detected that the TaskManager has shut down.

It is highly unlikely that the MetricQueryService is the cause of this; the exception you are seeing is due to the TaskManager being no longer reachable. Can't fetch metrics when the TaskManager isn't there anymore.

How do you mange the Flink cluster? (Yarn etc.) Given that no exception appears in the log i would assume that the TaskManager JVM was killed from the outside.

Regards,
Chesnay

On 20.04.2017 18:42, Jason Brelloch wrote:
Hey all,

So we are doing some experimenting around large keyed state in Flink 1.2 on a single task manager and we keep having our task manager killed by the job manager after about 10 minutes due to this exception:

Fetching metrics failed.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]] after [10000 ms]

The task manager logs show nothing out of the ordinary, but the job manager logs shows this:

2017-04-19 20:56:52,230 Association with remote system [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-04-19 20:56:53,986 Fetching metrics failed.
2017-04-19 20:57:43,584 Association with remote system [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by: [Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244 <http://10.34.48.40:37244>] 2017-04-19 20:57:49,517 Detected unreachable: [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] 2017-04-19 20:57:49,517 Task manager akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/taskmanager terminated.

The weird part is, we have not set up any metrics reporters or anything so I am not really sure why the Job Manager is asking the task manager about them. Is there a way to disable these metrics requests, or does anyone know what is causing them?

Thanks,
--
*Jason Brelloch* | Product Developer
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
<http://www.bettercloud.com/>
Subscribe to the BetterCloud Monitor <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch> - Get IT delivered to your inbox


Reply via email to