Re: Delay in REST/UI readiness during JM recovery

Joey Echeverria Sat, 11 Aug 2018 22:21:24 -0700

I wanted to follow up on this thread one last time as we found a solution for 
the recovery time that worked well for us.

Originally, we were running job by using a jar that shaded in all of our
dependencies. We switched to a more lightweight jar for the job itself and made
the dependency jar an extra element added to the class path. That sped up
recovery significantly to around ~1 minute for 250 jobs.

In case anyone else hits this again, this is something they can try.

-Joey

On Aug 6, 2018, at 7:10 PM, vino yang
<yanghua1...@gmail.com<mailto:yanghua1...@gmail.com>> wrote:

Hi Joey,

Thank you for finding these issues and creating them.

Thanks, vino.

2018-08-07 8:18 GMT+08:00 Joey Echeverria
<jechever...@splunk.com<mailto:jechever...@splunk.com>>:
Thanks for the ping Vino.

I created two JIRAs for the first two items:

1) https://issues.apache.org/jira/browse/FLINK-10077
2) https://issues.apache.org/jira/browse/FLINK-10078

Regarding (3) we’re doing some testing with different options for the state
storage. I’ll report back if we find anything significant there.

-Joey

On Aug 6, 2018, at 8:47 AM, vino yang
<yanghua1...@gmail.com<mailto:yanghua1...@gmail.com>> wrote:

Hi Joey,

Did you create these JIRA issues based on Till's suggestion?

If you didn't create them or you don't know how to do it, I can do it for you.
But I won't do it right away, I will wait for a while.

Thanks, vino.

2018-08-03 17:23 GMT+08:00 Till Rohrmann
<trohrm...@apache.org<mailto:trohrm...@apache.org>>:
Hi Joey,

your analysis is correct. Currently, the Dispatcher will first try to recover
all jobs before it confirms the leadership.

1) The Dispatcher provides much of the relevant information you see in the
web-ui. Without a leading Dispatcher, the web-ui cannot show much information.
But this could also be changed such that in the situation where no Dispatcher
is a leader, we cannot display certain information (number of running jobs, job
details, etc.). Could you create a JIRA issue to fix this problem?

2) The reason why the Dispatcher first tries to recover the jobs before
confirming the leadership is because it first tries to restore its internal
state before it is accessible by other components and, thus, state changes. For
example, the following problem could arise: Assume that you submit a job to the
cluster. The cluster receives the JobGraph and persists it in ZooKeeper. Before
the Dispatcher can acknowledge the job submission it fails. The client sees the
failure and tries to re-submit the job. Now the Dispatcher is restarted and
starts recovering the persisted jobs. If we don't wait for this to complete,
then the retried job submission could succeed first because it is just faster.
This would, however, let the job recovery fail because the Dispatcher is
already executing this job (due to the re-submission) and the assumption is
that recovered jobs are submitted first.

The same applies if you should submit a modified job with the same JobID as a
persisted job. Which job should the system then execute? The old one or the
newly submitted job. By waiting to first complete the recovery, we give
precedence to the persisted jobs.

One could solve this problem also slightly differently, by only blocking the
job submission while a recovery is happening. However, one should check that no
other RPCs change the internal state in such a way that it interferes with the
job recovery.

Could you maybe open a JIRA issue for solving this problem?

3) The job recovery is mainly limited by the connection to your persistent
storage system (HDFS or S3 I assume) where the JobGraphs are stored.
Alternatively, you could split the number of executed jobs across multiple
Flink clusters in order to decrease the number of jobs which need to be
recovered in case of a failure.

Thanks a lot for reporting and analysing this problem. This is definitely
something we should improve!

Cheers,
Till

On Fri, Aug 3, 2018 at 5:48 AM vino yang
<yanghua1...@gmail.com<mailto:yanghua1...@gmail.com>> wrote:
Hi Joey,

Good question!
I will copy it to Till and Chesnay who know this part of the implementation.

Thanks, vino.

2018-08-03 11:09 GMT+08:00 Joey Echeverria
<jechever...@splunk.com<mailto:jechever...@splunk.com>>:
I don’t have logs available yet, but I do have some information from ZK.

The culprit appears to be the /flink/default/leader/dispatcher_lock znode.

I took a look at the dispatcher code here:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785

And it looks to me that when leadership is granted it will perform job recovery
on all jobs before it writes the new leader information to the
/flink/default/leader/dispatcher_lock znode.

So this leaves me with three questions:

1) Why does the web monitor specifically have to wait for the dispatcher?
2) Is there a reason why the dispatcher can’t write the lock until after job
recovery?
3) Is there anything I can/should be doing to speed up job recovery?

Thanks!

-Joey

On Aug 2, 2018, at 9:24 AM, Joey Echeverria
<jechever...@splunk.com<mailto:jechever...@splunk.com>> wrote:

Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data and
see if I can post some logs.

I’ll also watch the leader znode to see if the election isn’t happening or if
it’s not being retrieved.

Thanks!

-Joey

On Aug 1, 2018, at 11:19 PM, Gary Yao
<g...@data-artisans.com<mailto:g...@data-artisans.com>> wrote:

Hi Joey,

If the other components (e.g., Dispatcher, ResourceManager) are able to finish
the leader election in a timely manner, I currently do not see a reason why it
should take the REST server 20 - 45 minutes.

You can check the contents of znode /flink/.../leader/rest_server_lock to see
if there is indeed no leader, or if the leader information cannot be retrieved
from ZooKeeper.

If you can reproduce this in a staging environment with some test jobs, I'd
like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level).

Best,
Gary

On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria
<jechever...@splunk.com<mailto:jechever...@splunk.com>> wrote:
I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job
Manager running. I’m using Zookeeper to store the fencing/leader information
and S3 to store the job manager state. We’ve been running around 250 or so
streaming jobs and we’ve noticed that if the job manager pod is deleted, it
takes something like 20-45 minutes for the job manager’s REST endpoints and web
UI to become available. Until it becomes available, we get a 503 response from
the HTTP server with the message "Could not retrieve the redirect address of
the current leader. Please try to refresh.”.

Has anyone else run into this?

Are there any configuration settings I should be looking at to speed up the
availability of the HTTP endpoints?

Thanks!

-Joey

Re: Delay in REST/UI readiness during JM recovery

Reply via email to