Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-23 Thread tison
Hi, It seems the leader info has been published but since you don't turn on DEBUG log on org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService still we can only *guess* the retrieval service in JobMaster doesn't get notified and even I don't see a INFO level log Starting

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-18 Thread Yang Wang
It seems that your zookeeper service is not stable. From the the log i find that resourcemanager leader is granted and taskmanager could register to resourcemanager successfully. That means the resourcemanager address has been published to the ZK successfully. Also a

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-17 Thread tison
Sorry I mixed up the log, it belongs to previous failure. Could you trying to reproduce the problem with DEBUG level log? >From the log we knew that JM & RM had been elected as leader but the listener didn't work. However, we didn't know it is because the leader didn't publish the leader info or

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-16 Thread Xintong Song
Hi Abhinav, I think you are right. The log confirms that JobMaster has not tried to connect ResourceManager. Most likely the JobMaster requested for RM address but has never received it. I would suggest you to check the ZK logs, see if the request form JM for RM address has been received and

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-05 Thread Xintong Song
Hi Abhinav, Thanks for the log. However, the attached log seems to be incomplete. The NoResourceAvailableException cannot be found in this log. Regarding connecting to ResourceManager, the log suggests that: - ZK was back to life and connected at 06:29:56. 2020-02-27 06:29:56.539

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-04 Thread Xintong Song
Hi Abhinav, Do you mind sharing the complete 'jobmanager.log'? org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot > request, no ResourceManager connected. > Sometimes you see this log because the ResourceManager is not yet connect when the slot request arrives the

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-04 Thread Bajaj, Abhinav
While I setup to reproduce the issue with debug logs, I would like to share more information I noticed in INFO logs. Below is the sequence of events/exceptions I notice during the time zookeeper was disrupted. I apologize in advance as they are a bit verbose. * Zookeeper seems to be down

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-04 Thread Bajaj, Abhinav
Thanks Xintong for pointing that out. I will dig deeper and get back with my findings. ~ Abhinav Bajaj From: Xintong Song Date: Tuesday, March 3, 2020 at 7:36 PM To: "Bajaj, Abhinav" Cc: "user@flink.apache.org" Subject: Re: JobMaster does not register with ResourceManager in high

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-03 Thread Xintong Song
Hi Abhinav, The JobMaster log "Connecting to ResourceManager ..." is printed after JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I assume there's some ZK problem that JM cannot resolve RM address. Have you confirmed whether the ZK pods are recovered after the second

JobMaster does not register with ResourceManager in high availability setup

2020-03-03 Thread Bajaj, Abhinav
Hi, We recently came across an issue where JobMaster does not register with ResourceManager in Fink high availability setup. Let me share the details below. Setup * Flink 1.7.1 * K8s * High availability mode with a single Jobmanager and 3 zookeeper nodes in quorum. Scenario *