Hi Gary, I’ve attached the relevant portions of the JM and TM logs.
Job Manager Logs: 2019-03-14 11:38:28,257 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED 2019-03-14 11:38:28,309 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component log file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log 2019-03-14 11:38:28,309 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component stdout file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out 2019-03-14 11:38:28,527 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at cluster:8080 2019-03-14 11:38:28,527 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. 2019-03-14 11:38:28,574 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://cluster:8080. 2019-03-14 11:38:28,613 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager . 2019-03-14 11:38:28,674 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher . 2019-03-14 11:38:28,691 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. 2019-03-14 11:38:28,694 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-03-14 11:38:28,698 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. 2019-03-14 11:38:28,700 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 2019-03-14 11:38:28,818 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster] 2019-03-14 11:39:09,010 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://cluster:8080 was granted leadership with leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7 2019-03-14 11:39:09,010 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb 2019-03-14 11:39:09,011 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager. 2019-03-14 11:39:09,012 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9 2019-03-14 11:39:09,017 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs. Task Manager Logs: 2019-03-14 11:42:35,790 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill files. 2019-03-14 11:42:35,820 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms 2019-03-14 11:42:35,839 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 . 2019-03-14 11:42:35,853 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-03-14 11:42:35,854 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service. 2019-03-14 11:42:35,855 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26 2019-03-14 11:42:35,871 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb). 2019-03-14 11:42:35,963 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@cluster:31794] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not known] 2019-03-14 11:42:35,964 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@cluster:31794/user/resourcemanager.. 2019-03-14 11:47:35,895 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error occurred in TaskExecutor akka.tcp://fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now. at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2019-03-14 11:47:35,897 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred while executing the TaskManager. Shutting it down... org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now. at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2019-03-14 11:47:35,904 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping TaskExecutor akka.tcp://fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. 2019-03-14 11:47:35,904 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-03-14 11:47:35,904 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager. 2019-03-14 11:47:35,908 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f 2019-03-14 11:47:35,908 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down the network environment and its components. 2019-03-14 11:47:35,914 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful shutdown (took 5 ms). 2019-03-14 11:47:35,917 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful shutdown (took 2 ms). 2019-03-14 11:47:35,925 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job leader service. 2019-03-14 11:47:35,931 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped TaskExecutor akka.tcp://fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. 2019-03-14 11:47:35,931 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache 2019-03-14 11:47:35,933 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache 2019-03-14 11:47:35,943 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting 2019-03-14 11:47:35,950 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x26977a24c4e0018 closed 2019-03-14 11:47:35,950 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x26977a24c4e0018 2019-03-14 11:47:35,950 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service. 2019-03-14 11:47:35,952 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2019-03-14 11:47:35,952 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2019-03-14 11:47:35,959 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2019-03-14 11:47:35,966 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2019-03-14 11:47:35,983 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. 2019-03-14 11:47:35,984 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. 2019-03-14 11:47:35,992 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service. From: Gary Yao <g...@ververica.com> Date: Thursday, 14 March 2019 at 9:06 PM To: Harshith Kumar Bolar <hk...@arity.com> Cc: user <user@flink.apache.org> Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job Manager Hi Harshith, Can you share JM and TM logs? Best, Gary On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com<mailto:hk...@arity.com>> wrote: Hi all, I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2 When I bring up the cluster, the task managers refuse to connect to the job managers with the following error. 2019-03-14 10:34:41,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known] Now, this works correctly if I add the following line into the /etc/hosts file. x.x.x.x job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> cluster Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster. Thanks, Harshith