Re: Issue with single job yarn flink cluster HA

2020-08-05 Thread Ken Krugler
Hi Dinesh, Did updating to Flink 1.10 resolve the issue? Thanks, — Ken > Hi Andrey, > Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed > and update in this thread. > > Thanks, > Dinesh > > On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin

Re: Issue with single job yarn flink cluster HA

2020-04-03 Thread Dinesh J
Hi Andrey, Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed and update in this thread. Thanks, Dinesh On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin wrote: > Hi Dinesh, > > Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. > [1] and [2]. > I

Re: Issue with single job yarn flink cluster HA

2020-04-02 Thread Andrey Zagrebin
Hi Dinesh, Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2]. I would suggest to try Flink 1.10. If the problem persists, could you also find the logs of the failed Job Manager before the failover? Best, Andrey [1] https://jira.apache.org/jira/browse/FLINK-14

Re: Issue with single job yarn flink cluster HA

2020-03-30 Thread Dinesh J
Hi Yang, I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint. Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever. Base

Re: Issue with single job yarn flink cluster HA

2020-03-30 Thread Yang Wang
I think your problem is not about akka timeout. Increase the timeout could help in a heavy load cluster, especially for the network is not very good. However, that is not your case now. I am not sure about the "never recovery". Do you mean the logs "Connection refused" keep going and do not have o

Re: Issue with single job yarn flink cluster HA

2020-03-30 Thread Dinesh J
HI Yang, Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever. Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this i

Re: Issue with single job yarn flink cluster HA

2020-03-30 Thread Yang Wang
Hi Dinesh, First, i think the error message your provided is not a problem. It just indicates that the leader election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide the webui and rest service. >From your jobmanager logs "Connection refused: host1/ip

Re: Issue with single job yarn flink cluster HA

2020-03-25 Thread Dinesh J
Hi Andrey, Yes . The job is not restarting sometimes after the current leader failure. Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days. During this time , even current yarn application

Re: Issue with single job yarn flink cluster HA

2020-03-24 Thread Andrey Zagrebin
Hi Dinesh, If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election. They look to me just as warnings that caused failover. Do you observe any problem with your application? Does the failover not work, e.g. n

Re: Issue with single job yarn flink cluster HA

2020-03-22 Thread Dinesh J
Attaching the job manager log for reference. 2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher. 2020-03-22 11:39:02,724 WARN akka.re

Issue with single job yarn flink cluster HA

2020-03-22 Thread Dinesh J
Hi all, We have single job yarn flink cluster setup with High Availability. Sometimes job manager failure successfully restarts next attempt from current checkpoint. But occasionally we are getting below error. {"errors":["Service temporarily unavailable due to an ongoing leader election. Please r