Hello, I am having trouble when deploying Apache Flink 1.16.1 on 2 Google Cloud instances with Docker Swarm. The JobManager is deployed on the manager node and the TaskManager is deployed in the worker node. The TaskManager seems to have trouble to communicate with ResourceManager on JobManager via akka.tcp. Here is the log on the TaskManager
flink_taskmanager.1.e7sxy43ls...@workernode.gcp | 2023-03-29 13:50:34,061 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job leader service. flink_taskmanager.1.e7sxy43ls...@workernode.gcp | 2023-03-29 13:50:34,066 INFO org.apache.flink.runtime.filecache.FileCache [] - User file cache uses directory /tmp/flink-dist-cache-705fdca5-c285-4875-9b68-556ccd1b56c3 flink_taskmanager.1.e7sxy43ls...@workernode.gcp | 2023-03-29 13:50:34,073 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000). flink_taskmanager.1.e7sxy43ls...@workernode.gcp | 2023-03-29 13:50:34,420 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved ResourceManager address, beginning registration flink_taskmanager.1.e7sxy43ls...@workernode.gcp | 2023-03-29 13:55:34,086 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error occurred in TaskExecutor akka.tcp://flink@10.0.1.14:6127/user/rpc/taskmanager_0. flink_taskmanager.1.e7sxy43ls...@workernode.gcp | org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration PT5M. This indicates a problem with this instance. Terminating now. About firewall rules, these instances are deployed in the same VPC, same subnet, so I suppose it can communicate without any troubles. In fact, I can ping or curl on both instances. Before deploying to Google Cloud, I have successfully deployed the same setup on my private Microstack cloud without any problem. Here is my docker-compose file that I used docker stack deploy: version: '3.8' services: jobmanager: image: halo93/fixed-ports-flink-docker:1.16.1-scala_2.12-java11-custom deploy: replicas: 1 placement: constraints: [node.hostname == managernode.gcp] ports: - "8081:8081" - "6123:6123" - "6124:6124" - "6125:6125" command: jobmanager environment: - FLINK_PROPERTIES=${FLINK_PROPERTIES} networks: - flink-network taskmanager: image: halo93/fixed-ports-flink-docker:1.16.1-scala_2.12-java11-custom deploy: replicas: 1 placement: constraints: [node.hostname == workernode.gcp] depends_on: - jobmanager ports: - "6121:6121" - "6122:6122" - "6126:6126" - "6127:6127" - "6128:6128" - "5005:5005/udp" command: - taskmanager environment: - FLINK_PROPERTIES=${FLINK_PROPERTIES} networks: - flink-network networks: flink-network: driver: overlay attachable: true FLINK_PROPERTIES FLINK_PROPERTIES=$'\njobmanager.rpc.address: jobmanager\nparallelism.default: 2\n' I am using a customized flink docker image to fix taskmanager.data.port and taskmanager.rpc.port to 6126 and 6127. I have tried to change jobmanager.rpc.address with the private IP and zonal DNS, ResourceManager can register the taskmanager. However, by doing so, flink-metrics is unable to work. I expect that I can successfully deploy flink cluster on Google Cloud instances with Docker Swarm, same with what I did on Microstack.