Hi Gary,
The job manager was indeed being invoked with a second parameter.
${Flink_HOME}/bin/jobmanager.sh start cluster
I removed the second argument and everything works fine now. I really
appreciate your help. Thanks a lot :-)
Regards,
Harshith
From: Gary Yao <[email protected]>
Date: Friday, 15 March 2019 at 12:41 PM
To: Harshith Kumar Bolar <[email protected]>
Cc: user <[email protected]>
Subject: [External] Re: Re: Re: Flink 1.7.2: Task Manager not able to connect
to Job Manager
I forgot to add line numbers to the first link in my previous email:
https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh#L21-L25<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_c6878aca6c5aeee46581b4d6744b31049db9de95_flink-2Ddist_src_main_flink-2Dbin_bin_jobmanager.sh-23L21-2DL25&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Zjr-keKi2IBMDTHA8ihnUHHIICDPlHlBQ5YHyd0jCsg&e=>
On Fri, Mar 15, 2019 at 8:08 AM Gary Yao
<[email protected]<mailto:[email protected]>> wrote:
Hi Harshith,
In the
jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=ucI2Ko8YXht8q_dDoC1y1PFDNCR71WMQhOsNmEHaTQ8&e=>
script, the 2nd argument is assigned to the HOST variable
[1]. How are you invoking
jobmanager.sh?<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh-3F&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Qs4ewIueVgfMDLe2mEGG52OO0Iz1AenYYEvMC4BRTyE&e=>
Prior to 1.5, the script expected an
execution mode (local or cluster) but this is no longer the case [2].
Best,
Gary
[1]
https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_c6878aca6c5aeee46581b4d6744b31049db9de95_flink-2Ddist_src_main_flink-2Dbin_bin_jobmanager.sh&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Y8e2G-l3Q_hhzX4wQXv4ta08fqVSctieeKtAfRLiiiU&e=>
[2]
https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_commit_d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=OMtNVCMgKGinpOdJIzJFpN7jTHfYdG__HGAi89iFr7Y&e=>
On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith
<[email protected]<mailto:[email protected]>> wrote:
Hi Gary,
An update. I noticed the line “–host cluster” in the program arguments section
of the job manager logs. So, I commented the following section in
jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=ucI2Ko8YXht8q_dDoC1y1PFDNCR71WMQhOsNmEHaTQ8&e=>,
the task manager is now able to connect to job manager without issues.
if [ ! -z $HOST ]; then
args+=("--host")
args+=("${HOST}")
fi
Task manager logs after commenting those lines:
2019-03-14 22:31:02,863 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Starting RPC endpoint for
org.apache.flink.runtime.taskexecutor.TaskExecutor at
akka://flink/user/taskmanager_0 .
2019-03-14 22:31:02,875 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 22:31:02,876 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job
leader service.
2019-03-14 22:31:02,877 INFO org.apache.flink.runtime.filecache.FileCache
- User file cache uses directory
/tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008
2019-03-14 22:31:02,884 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to
ResourceManager
akka.tcp://[email protected]:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)<http://[email protected]:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.
2019-03-14 22:31:03,109 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Resolved
ResourceManager address, beginning registration
2019-03-14 22:31:03,110 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at
ResourceManager attempt 1 (timeout=100ms)
2019-03-14 22:31:03,228 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at
ResourceManager attempt 2 (timeout=200ms)
2019-03-14 22:31:03,266 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Successful
registration at resource manager
akka.tcp://[email protected]:28945/user/resourcemanager<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink0-2D1.flink1.us-2Deast-2D1.abc.com-3A28945_user_resourcemanager&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=8sclFNDXXxcpveP2rVMT9IV0EDutln2sH1Wjqts1LDc&e=>
under registration id 170ee6a00f80ee02ead0e88710093d77.
Thanks,
Harshith
From: Harshith Kumar Bolar <[email protected]<mailto:[email protected]>>
Date: Friday, 15 March 2019 at 7:38 AM
To: Gary Yao <[email protected]<mailto:[email protected]>>
Cc: user <[email protected]<mailto:[email protected]>>
Subject: Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect
to Job Manager
Hi Gary,
Here are the full job manager and task manager logs. In the job manager logs, I
see it says “starting StandaloneSessionClusterEntrypoint”, whereas in Flink
1.4.2, it used to say “starting JobManager”. Is this correct?
Job manager logs:
https://paste.ubuntu.com/p/DCVzsQdpHq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_p_DCVzsQdpHq_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=JzWlmLhDDSHq1XWZIZcc2QsBkNKbzbIrXEQAUR_USpQ&e=>
(https://paste(.)ubuntu(.)com/p/DCVzsQdpHq
/<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste-28.-29ubuntu-28.-29com_p_DCVzsQdpHq-2520_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=iHoPp3yNAYsf7Br59RaHYI6bpj5Mow7APuTQK-OcBK8&e=>)
Task Manager logs:
https://paste.ubuntu.com/p/wbvYFZxdT8/<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_p_wbvYFZxdT8_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=FF_P5g4n1WW1NgjMy-euWbnr1dlWNlpjKpYD3-R8VbM&e=>
(https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste-28.-29ubuntu-28.-29com_p_wbvYFZxdT8_-29&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=la-LhYqYvP-G81zfyM54X9-B3N7seycQMwc6vZWBTaw&e=>
Thanks,
Harshith
From: Gary Yao <[email protected]<mailto:[email protected]>>
Date: Thursday, 14 March 2019 at 10:11 PM
To: Harshith Kumar Bolar <[email protected]<mailto:[email protected]>>
Cc: user <[email protected]<mailto:[email protected]>>
Subject: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to
Job Manager
Hi Harshith,
The truncated log is not enough. Can you share the complete logs? If that's
not possible, I'd like to see the beginning of the log files where the cluster
configuration is logged.
The TaskManager tries to connect to the leader that is advertised in
ZooKeeper. In your case the "cluster" hostname is advertised which hints a
problem in your Flink configuration.
Best,
Gary
On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith
<[email protected]<mailto:[email protected]>> wrote:
Hi Gary,
I’ve attached the relevant portions of the JM and TM logs.
Job Manager Logs:
2019-03-14 11:38:28,257 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: CONNECTED
2019-03-14 11:38:28,309 INFO
org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined
location of main cluster component log file:
/opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
2019-03-14 11:38:28,309 INFO
org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined
location of main cluster component stdout file:
/opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
2019-03-14 11:38:28,527 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint
listening at cluster:8080
2019-03-14 11:38:28,527 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2019-03-14 11:38:28,574 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend
listening at
http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>.
2019-03-14 11:38:28,613 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Starting RPC endpoint for
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
akka://flink/user/resourcemanager .
2019-03-14 11:38:28,674 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Starting RPC endpoint for
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
akka://flink/user/dispatcher .
2019-03-14 11:38:28,691 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2019-03-14 11:38:28,694 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:38:28,698 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Starting ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2019-03-14 11:38:28,700 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-03-14 11:38:28,818 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@cluster:22671]
has failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@cluster:22671]] Caused by: [cluster]
2019-03-14 11:39:09,010 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint -
http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
was granted leadership with
leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
2019-03-14 11:39:09,010 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted
leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
2019-03-14 11:39:09,011 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting
the SlotManager.
2019-03-14 11:39:09,012 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher
akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with
fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
2019-03-14 11:39:09,017 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all
persisted jobs.
Task Manager Logs:
2019-03-14 11:42:35,790 INFO
org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager
uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill
files.
2019-03-14 11:42:35,820 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have
a max timeout of 10000 ms
2019-03-14 11:42:35,839 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Starting RPC endpoint for
org.apache.flink.runtime.taskexecutor.TaskExecutor at
akka://flink/user/taskmanager_0 .
2019-03-14 11:42:35,853 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:42:35,854 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job
leader service.
2019-03-14 11:42:35,855 INFO org.apache.flink.runtime.filecache.FileCache
- User file cache uses directory
/tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
2019-03-14 11:42:35,871 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to
ResourceManager
akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
2019-03-14 11:42:35,963 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@cluster:31794]
has failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not
known]
2019-03-14 11:42:35,964 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@cluster:31794/user/resourcemanager..
2019-03-14 11:47:35,895 ERROR
org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error
occurred in TaskExecutor
akka.tcp://[email protected]:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration 300000 ms. This indicates a problem with this instance.
Terminating now.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,897 ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error
occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration 300000 ms. This indicates a problem with this instance.
Terminating now.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,904 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping
TaskExecutor
akka.tcp://[email protected]:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,904 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:47:35,904 INFO
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting
down TaskExecutorLocalStateStoresManager.
2019-03-14 11:47:35,908 INFO
org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager
removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
2019-03-14 11:47:35,908 INFO
org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down
the network environment and its components.
2019-03-14 11:47:35,914 INFO
org.apache.flink.runtime.io.network.netty.NettyClient - Successful
shutdown (took 5 ms).
2019-03-14 11:47:35,917 INFO
org.apache.flink.runtime.io.network.netty.NettyServer - Successful
shutdown (took 2 ms).
2019-03-14 11:47:35,925 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job leader
service.
2019-03-14 11:47:35,931 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped
TaskExecutor
akka.tcp://[email protected]:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,931 INFO org.apache.flink.runtime.blob.PermanentBlobCache
- Shutting down BLOB cache
2019-03-14 11:47:35,933 INFO org.apache.flink.runtime.blob.TransientBlobCache
- Shutting down BLOB cache
2019-03-14 11:47:35,943 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- backgroundOperationsLoop exiting
2019-03-14 11:47:35,950 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session:
0x26977a24c4e0018 closed
2019-03-14 11:47:35,950 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn -
EventThread shut down for session: 0x26977a24c4e0018
2019-03-14 11:47:35,950 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Stopping Akka RPC service.
2019-03-14 11:47:35,952 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down
remote daemon.
2019-03-14 11:47:35,952 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon
shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,959 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down
remote daemon.
2019-03-14 11:47:35,966 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon
shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,983 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut
down.
2019-03-14 11:47:35,984 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut
down.
2019-03-14 11:47:35,992 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
- Stopped Akka RPC service.
From: Gary Yao <[email protected]<mailto:[email protected]>>
Date: Thursday, 14 March 2019 at 9:06 PM
To: Harshith Kumar Bolar <[email protected]<mailto:[email protected]>>
Cc: user <[email protected]<mailto:[email protected]>>
Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job
Manager
Hi Harshith,
Can you share JM and TM logs?
Best,
Gary
On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith
<[email protected]<mailto:[email protected]>> wrote:
Hi all,
I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
When I bring up the cluster, the task managers refuse to connect to the job
managers with the following error.
2019-03-14 10:34:41,551 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@cluster:22671] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not
known]
Now, this works correctly if I add the following line into the /etc/hosts file.
x.x.x.x
job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=>
cluster
Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2
used to have the job manager's address instead of the word cluster.
Thanks,
Harshith