I wanted to follow up on this thread one last time as we found a solution for
the recovery time that worked well for us.
Originally, we were running job by using a jar that shaded in all of our
dependencies. We switched to a more lightweight jar for the job itself and made
the dependency jar
Hi Joey,
Thank you for finding these issues and creating them.
Thanks, vino.
2018-08-07 8:18 GMT+08:00 Joey Echeverria :
> Thanks for the ping Vino.
>
> I created two JIRAs for the first two items:
>
> 1) https://issues.apache.org/jira/browse/FLINK-10077
> 2)
Thanks for the ping Vino.
I created two JIRAs for the first two items:
1) https://issues.apache.org/jira/browse/FLINK-10077
2) https://issues.apache.org/jira/browse/FLINK-10078
Regarding (3) we’re doing some testing with different options for the state
storage. I’ll report back if we find
Hi Joey,
Did you create these JIRA issues based on Till's suggestion?
If you didn't create them or you don't know how to do it, I can do it for
you. But I won't do it right away, I will wait for a while.
Thanks, vino.
2018-08-03 17:23 GMT+08:00 Till Rohrmann :
> Hi Joey,
>
> your analysis is
Hi Joey,
your analysis is correct. Currently, the Dispatcher will first try to
recover all jobs before it confirms the leadership.
1) The Dispatcher provides much of the relevant information you see in the
web-ui. Without a leading Dispatcher, the web-ui cannot show much
information. But this
Hi Joey,
Good question!
I will copy it to Till and Chesnay who know this part of the implementation.
Thanks, vino.
2018-08-03 11:09 GMT+08:00 Joey Echeverria :
> I don’t have logs available yet, but I do have some information from ZK.
>
> The culprit appears to be the
I don’t have logs available yet, but I do have some information from ZK.
The culprit appears to be the /flink/default/leader/dispatcher_lock znode.
I took a look at the dispatcher code here:
Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data and
see if I can post some logs.
I’ll also watch the leader znode to see if the election isn’t happening or if
it’s not being retrieved.
Thanks!
-Joey
On Aug 1, 2018, at 11:19 PM, Gary Yao
Hi Joey,
If the other components (e.g., Dispatcher, ResourceManager) are able to
finish
the leader election in a timely manner, I currently do not see a reason why
it
should take the REST server 20 - 45 minutes.
You can check the contents of znode /flink/.../leader/rest_server_lock to
see
if
Hi Joey,
Currently rest endpoints are hosted in JM. Your scenario is at JM failover,
and your cluster is running so many jobs. Here, it takes a certain amount
of time for ZK to conduct the Leader election. Then JM needs to wait for
the TM registration. So many jobs need to be restored and start
Sorry to ping my own thread, but has anyone else encountered this?
-Joey
> On Jul 30, 2018, at 11:10 AM, Joey Echeverria wrote:
>
> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job
> Manager running. I’m using Zookeeper to store the fencing/leader information
>
11 matches
Mail list logo