Hi Aleksandar, base on your log:
taskmanager_1 | 2019-08-22 00:05:03,713 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000) . taskmanager_1 | 2019-08-22 00:05:04,137 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager.. it looks like you return a jobmanager address on retrieval service of resource manager. Please check the implementation carefully or share it on mailing list that others can help for investigation. Best, tison. Zhu Zhu <reed...@gmail.com> 于2019年8月22日周四 上午10:11写道: > Hi Aleksandar, > > The resource manager address is retrieved from the HA services. > Would you check whether your customized HA services is returning the > right LeaderRetrievalService and whether the LeaderRetrievalService is > really retrieving the right leader's address? > Or is it possible that the stored resource manager address in HA is > replaced by jobmanager address in any case? > > Thanks, > Zhu Zhu > > Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年8月22日周四 > 上午8:16写道: > >> Hi all, >> >> I’m experimenting with using my own implementation of HA services instead >> of ZooKeeper that would persist JobManager information on a Kubernetes >> volume instead of in ZooKeeper. >> >> I’ve set the high-availability option in flink-conf.yaml to the FQN of my >> factory class, and started the docker ensemble as I usually do (i.e. with >> no special “cluster” arguments or scripts.) >> >> What’s happening now is that TaskManager is unable to connect to >> ResourceManager, because it seems it’s trying to use the /user/jobmanager >> path instead of /user/resourcemanager. >> >> Here’s what I found in the logs: >> >> >> jobmanager_1 | 2019-08-22 00:05:00,963 INFO akka.remote.Remoting >> - Remoting started; listening on >> addresses :[akka.tcp://flink@jobmanager:6123] >> jobmanager_1 | 2019-08-22 00:05:00,975 INFO >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor >> system started at akka.tcp://flink@jobmanager:6123 >> >> jobmanager_1 | 2019-08-22 00:05:02,380 INFO >> org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting >> RPC endpoint for >> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at >> akka://flink/user/resourcemanager . >> >> jobmanager_1 | 2019-08-22 00:05:03,138 INFO >> org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting >> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher >> at akka://flink/user/dispatcher . >> >> jobmanager_1 | 2019-08-22 00:05:03,211 INFO >> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager >> was granted leadership with fencing token 00000000000000000000000000000000 >> >> jobmanager_1 | 2019-08-22 00:05:03,292 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher >> akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership >> with fencing token 00000000-0000-0000-0000-000000000000 >> >> taskmanager_1 | 2019-08-22 00:05:03,713 INFO >> org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting >> to ResourceManager >> akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000) >> . >> taskmanager_1 | 2019-08-22 00:05:04,137 INFO >> org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not >> resolve ResourceManager address >> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: >> Could not connect to rpc endpoint under address >> akka.tcp://flink@jobmanager:6123/user/jobmanager.. >> >> Is this a known bug? I’d appreciate any help I can get. >> >> Thanks, >> Aleksandar Mastilovic >> >