subject:"\[jira\] \[Commented\] \(FLINK\-7340\) Taskmanager hung after temporary DNS outage"

[jira] [Commented] (FLINK-7340) Taskmanager hung after temporary DNS outage

2017-10-16 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206053#comment-16206053
 ] 

Till Rohrmann commented on FLINK-7340:
--

Thanks for the pointer [~hadronzoo],

I think this problem could be related to Akka since we store the addresses of 
remote components always as Strings. For Akka {{2.4}} I could find dns cache 
configuration options. For Akka {{<=2.3}} these configuration options are 
missing. Since we were using so far Flakka which is based on Akka {{2.3}} this 
could be related and hopefully solved by upgrading to a more recent Akka 
version with FLINK-7810.

In all cases, we have to verify whether this is still the case with the new 
Akka version and if so, see how we can solve the problem.

> Taskmanager hung after temporary DNS outage
> ---
>
> Key: FLINK-7340
> URL: https://issues.apache.org/jira/browse/FLINK-7340
> Project: Flink
>  Issue Type: Bug
>  Components: Core, Distributed Coordination
>Affects Versions: 1.3.1
> Environment: Non-HA Flink running in Kubernetes.
>Reporter: Joshua Griffith
>
> After a Kubernetes node failure, several TaskManagers and the DNS system were 
> automatically restarted. One TaskManager was unable to connect to the 
> JobManager and continually logged the following errors:
> {quote}
> 2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to register at 
> JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, 
> timeout: 3 milliseconds)
> 2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO  
> Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined 
> address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not 
> been restarted. Keeping it quarantined.
> {quote}
> After exec'ing into the container, I was able to {{telnet jobmanager 6123}} 
> successfully and {{dig jobmanager}} showed the correct IP in DNS. I suspect 
> that the TaskManager cached a bad IP address for the JobManager when the DNS 
> system was restarting and it used that cached address rather than respecting 
> the 30s TTL and getting a new one for the next request. It may be a good idea 
> for the TaskManager to explicitly perform a DNS lookup after JobManager 
> connection failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7340) Taskmanager hung after temporary DNS outage

2017-10-14 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204715#comment-16204715
 ] 

Stephan Ewen commented on FLINK-7340:
-

Thanks, that is a good pointer!

[~till.rohrmann] Can we take this into account in the Akka / HA address 
management

> Taskmanager hung after temporary DNS outage
> ---
>
> Key: FLINK-7340
> URL: https://issues.apache.org/jira/browse/FLINK-7340
> Project: Flink
>  Issue Type: Bug
>  Components: Core, Distributed Coordination
>Affects Versions: 1.3.1
> Environment: Non-HA Flink running in Kubernetes.
>Reporter: Joshua Griffith
>
> After a Kubernetes node failure, several TaskManagers and the DNS system were 
> automatically restarted. One TaskManager was unable to connect to the 
> JobManager and continually logged the following errors:
> {quote}
> 2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to register at 
> JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, 
> timeout: 3 milliseconds)
> 2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO  
> Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined 
> address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not 
> been restarted. Keeping it quarantined.
> {quote}
> After exec'ing into the container, I was able to {{telnet jobmanager 6123}} 
> successfully and {{dig jobmanager}} showed the correct IP in DNS. I suspect 
> that the TaskManager cached a bad IP address for the JobManager when the DNS 
> system was restarting and it used that cached address rather than respecting 
> the 30s TTL and getting a new one for the next request. It may be a good idea 
> for the TaskManager to explicitly perform a DNS lookup after JobManager 
> connection failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7340) Taskmanager hung after temporary DNS outage

[jira] [Commented] (FLINK-7340) Taskmanager hung after temporary DNS outage

2 matches

Site Navigation

Mail list logo

Footer information