Hello,
the MetricQueryService is used by the webUI to fetch fetch metrics from
the JobManager and all TaskManagers. It is only used when the
webUI is accessed.
Based on the logs you gave the TaskManager isn't killed by the
JobManager; instead the JobManager only detected that the TaskManager
has shut down.
It is highly unlikely that the MetricQueryService is the cause of this;
the exception you are seeing is due to the TaskManager being no longer
reachable. Can't fetch metrics when the TaskManager isn't there anymore.
How do you mange the Flink cluster? (Yarn etc.) Given that no exception
appears in the log i would assume that the TaskManager JVM was killed
from the outside.
Regards,
Chesnay
On 20.04.2017 18:42, Jason Brelloch wrote:
Hey all,
So we are doing some experimenting around large keyed state in Flink
1.2 on a single task manager and we keep having our task manager
killed by the job manager after about 10 minutes due to this exception:
Fetching metrics failed.
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]]
after [10000 ms]
The task manager logs show nothing out of the ordinary, but the job
manager logs shows this:
2017-04-19 20:56:52,230 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed,
address is now gated for [5000] ms. Reason: [Disassociated]
2017-04-19 20:56:53,986 Fetching metrics failed.
2017-04-19 20:57:43,584 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by:
[Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244
<http://10.34.48.40:37244>]
2017-04-19 20:57:49,517 Detected unreachable:
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]
2017-04-19 20:57:49,517 Task manager
akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/taskmanager
terminated.
The weird part is, we have not set up any metrics reporters or
anything so I am not really sure why the Job Manager is asking the
task manager about them. Is there a way to disable these metrics
requests, or does anyone know what is causing them?
Thanks,
--
*Jason Brelloch* | Product Developer
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
<http://www.bettercloud.com/>
Subscribe to the BetterCloud Monitor
<https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch> -
Get IT delivered to your inbox