Re: Delay in REST/UI readiness during JM recovery

2018-08-11 Thread Joey Echeverria
I wanted to follow up on this thread one last time as we found a solution for the recovery time that worked well for us. Originally, we were running job by using a jar that shaded in all of our dependencies. We switched to a more lightweight jar for the job itself and made the dependency jar

Re: Delay in REST/UI readiness during JM recovery

2018-08-06 Thread vino yang
Hi Joey, Thank you for finding these issues and creating them. Thanks, vino. 2018-08-07 8:18 GMT+08:00 Joey Echeverria : > Thanks for the ping Vino. > > I created two JIRAs for the first two items: > > 1) https://issues.apache.org/jira/browse/FLINK-10077 > 2)

Re: Delay in REST/UI readiness during JM recovery

2018-08-06 Thread Joey Echeverria
Thanks for the ping Vino. I created two JIRAs for the first two items: 1) https://issues.apache.org/jira/browse/FLINK-10077 2) https://issues.apache.org/jira/browse/FLINK-10078 Regarding (3) we’re doing some testing with different options for the state storage. I’ll report back if we find

Re: Delay in REST/UI readiness during JM recovery

2018-08-06 Thread vino yang
Hi Joey, Did you create these JIRA issues based on Till's suggestion? If you didn't create them or you don't know how to do it, I can do it for you. But I won't do it right away, I will wait for a while. Thanks, vino. 2018-08-03 17:23 GMT+08:00 Till Rohrmann : > Hi Joey, > > your analysis is

Re: Delay in REST/UI readiness during JM recovery

2018-08-03 Thread Till Rohrmann
Hi Joey, your analysis is correct. Currently, the Dispatcher will first try to recover all jobs before it confirms the leadership. 1) The Dispatcher provides much of the relevant information you see in the web-ui. Without a leading Dispatcher, the web-ui cannot show much information. But this

Re: Delay in REST/UI readiness during JM recovery

2018-08-02 Thread vino yang
Hi Joey, Good question! I will copy it to Till and Chesnay who know this part of the implementation. Thanks, vino. 2018-08-03 11:09 GMT+08:00 Joey Echeverria : > I don’t have logs available yet, but I do have some information from ZK. > > The culprit appears to be the

Re: Delay in REST/UI readiness during JM recovery

2018-08-02 Thread Joey Echeverria
I don’t have logs available yet, but I do have some information from ZK. The culprit appears to be the /flink/default/leader/dispatcher_lock znode. I took a look at the dispatcher code here:

Re: Delay in REST/UI readiness during JM recovery

2018-08-02 Thread Joey Echeverria
Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data and see if I can post some logs. I’ll also watch the leader znode to see if the election isn’t happening or if it’s not being retrieved. Thanks! -Joey On Aug 1, 2018, at 11:19 PM, Gary Yao

Re: Delay in REST/UI readiness during JM recovery

2018-08-02 Thread Gary Yao
Hi Joey, If the other components (e.g., Dispatcher, ResourceManager) are able to finish the leader election in a timely manner, I currently do not see a reason why it should take the REST server 20 - 45 minutes. You can check the contents of znode /flink/.../leader/rest_server_lock to see if

Re: Delay in REST/UI readiness during JM recovery

2018-08-01 Thread vino yang
Hi Joey, Currently rest endpoints are hosted in JM. Your scenario is at JM failover, and your cluster is running so many jobs. Here, it takes a certain amount of time for ZK to conduct the Leader election. Then JM needs to wait for the TM registration. So many jobs need to be restored and start

Re: Delay in REST/UI readiness during JM recovery

2018-08-01 Thread Joey Echeverria
Sorry to ping my own thread, but has anyone else encountered this? -Joey > On Jul 30, 2018, at 11:10 AM, Joey Echeverria wrote: > > I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job > Manager running. I’m using Zookeeper to store the fencing/leader information >