Hi,
It seems the leader info has been published but since you don't turn on
DEBUG log on
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
still we can only *guess* the retrieval service in JobMaster doesn't get
notified and even I don't see a INFO level log
Starting
It seems that your zookeeper service is not stable. From the the log i find
that resourcemanager
leader is granted and taskmanager could register to resourcemanager
successfully. That means
the resourcemanager address has been published to the ZK successfully.
Also a
Sorry I mixed up the log, it belongs to previous failure.
Could you trying to reproduce the problem with DEBUG level log?
>From the log we knew that JM & RM had been elected as leader but the
listener didn't work. However, we didn't know it is because the leader
didn't publish the leader info or
Hi Abhinav,
I think you are right. The log confirms that JobMaster has not tried to
connect ResourceManager. Most likely the JobMaster requested for RM address
but has never received it.
I would suggest you to check the ZK logs, see if the request form JM for RM
address has been received and
Hi Abhinav,
Thanks for the log. However, the attached log seems to be incomplete.
The NoResourceAvailableException cannot be found in this log.
Regarding connecting to ResourceManager, the log suggests that:
- ZK was back to life and connected at 06:29:56.
2020-02-27 06:29:56.539
Hi Abhinav,
Do you mind sharing the complete 'jobmanager.log'?
org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot
> request, no ResourceManager connected.
>
Sometimes you see this log because the ResourceManager is not yet connect
when the slot request arrives the
While I setup to reproduce the issue with debug logs, I would like to share
more information I noticed in INFO logs.
Below is the sequence of events/exceptions I notice during the time zookeeper
was disrupted.
I apologize in advance as they are a bit verbose.
* Zookeeper seems to be down
Thanks Xintong for pointing that out.
I will dig deeper and get back with my findings.
~ Abhinav Bajaj
From: Xintong Song
Date: Tuesday, March 3, 2020 at 7:36 PM
To: "Bajaj, Abhinav"
Cc: "user@flink.apache.org"
Subject: Re: JobMaster does not register with ResourceManager in high
Hi Abhinav,
The JobMaster log "Connecting to ResourceManager ..." is printed after
JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I
assume there's some ZK problem that JM cannot resolve RM address.
Have you confirmed whether the ZK pods are recovered after the second
Hi,
We recently came across an issue where JobMaster does not register with
ResourceManager in Fink high availability setup.
Let me share the details below.
Setup
* Flink 1.7.1
* K8s
* High availability mode with a single Jobmanager and 3 zookeeper nodes in
quorum.
Scenario
*
10 matches
Mail list logo