Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-04-13 Thread Lu Niu
as disconnected. >> >> >> >> We run on AWS and this seems to be AWS related. >> >> >> >> >> >> *From:* Xintong Song >> *Sent:* Sunday, January 31, 2021 9:23 PM >> *To:* user >> *Subject:* Re: Flink 1.11 job hit error "Job leader

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-03-31 Thread Lu Niu
, January 31, 2021 9:23 PM > *To:* user > *Subject:* Re: Flink 1.11 job hit error "Job leader lost leadership" or > "ResourceManager leader changed to new address null" > > > > *This email is from an external source - **exercise caution regarding > links

RE: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-03-27 Thread Colletta, Edward
to be AWS related. From: Xintong Song Sent: Sunday, January 31, 2021 9:23 PM To: user Subject: Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null" This email is from an external source - exercise caution regarding li

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-31 Thread Xintong Song
ror on Flink 1.11.2 this past > week. > > > > *From:* Xintong Song > *Sent:* Friday, January 29, 2021 7:34 PM > *To:* user > *Subject:* Re: Flink 1.11 job hit error "Job leader lost leadership" or > "ResourceManager leader changed to new address null

RE: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-30 Thread Colletta, Edward
“but I'm not aware of any similar issue reported since the upgrading” For the record, we experienced this same error on Flink 1.11.2 this past week. From: Xintong Song Sent: Friday, January 29, 2021 7:34 PM To: user Subject: Re: Flink 1.11 job hit error "Job leader lost leade

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-29 Thread Xintong Song
Thank you~ Xintong Song On Sat, Jan 30, 2021 at 8:27 AM Xintong Song wrote: > There's indeed a ZK version upgrading during 1.9 and 1.11, but I'm not > aware of any similar issue reported since the upgrading. > I would suggest the following: > - Turn on the DEBUG log see if there's any

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-29 Thread Xintong Song
There's indeed a ZK version upgrading during 1.9 and 1.11, but I'm not aware of any similar issue reported since the upgrading. I would suggest the following: - Turn on the DEBUG log see if there's any valuable details - Maybe try asking in the Apache Zookeeper community, see if this is a known

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-28 Thread Xintong Song
The ZK client side uses 15s connection timeout and 60s session timeout in Flink. There's nothing similar to a heartbeat interval configured, which I assume is up to ZK's internal implementation. These things have not changed in FLink since at least 2017. If both ZK client and server complain

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2021-01-28 Thread Lu Niu
After checking the log I found the root cause is zk client timeout on TM: ``` 2021-01-25 14:01:49,600 WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40020ms for sessionid 0x404f9ca531a5d6f 2021-01-25 14:01:49,610

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-17 Thread Xintong Song
I'm not aware of any significant changes to the HA components between 1.9/1.11. Would you mind sharing the complete jobmanager/taskmanager logs? Thank you~ Xintong Song On Fri, Dec 18, 2020 at 8:53 AM Lu Niu wrote: > Hi, Xintong > > Thanks for replying and your suggestion. I did check the

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-17 Thread Lu Niu
Hi, Xintong Thanks for replying and your suggestion. I did check the ZK side but there is nothing interesting. The error message actually shows that only one TM thought JM lost leadership while others ran fine. Also, this happened only after we migrated from 1.9 to 1.11. Is it possible this is

Re: Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-16 Thread Xintong Song
Hi Lu, I assume you are using ZooKeeper as the HA service? A common cause of unexpected leadership lost is the instability of HA service. E.g., if ZK does not receive heartbeat from Flink RM for a certain period of time, it will revoke the leadership and notify other components. You can look

Flink 1.11 job hit error "Job leader lost leadership" or "ResourceManager leader changed to new address null"

2020-12-16 Thread Lu Niu
Hi, Flink users Recently we migrated to flink 1.11 and see exceptions like: ``` 2020-12-15 12:41:01,199 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: USER_MATERIALIZED_EVENT_SIGNAL-user_context-event ->