Hi Joey,

Did you create these JIRA issues based on Till's suggestion?

If you didn't create them or you don't know how to do it, I can do it for
you. But I won't do it right away, I will wait for a while.

Thanks, vino.

2018-08-03 17:23 GMT+08:00 Till Rohrmann <trohrm...@apache.org>:

> Hi Joey,
>
> your analysis is correct. Currently, the Dispatcher will first try to
> recover all jobs before it confirms the leadership.
>
> 1) The Dispatcher provides much of the relevant information you see in the
> web-ui. Without a leading Dispatcher, the web-ui cannot show much
> information. But this could also be changed such that in the situation
> where no Dispatcher is a leader, we cannot display certain information
> (number of running jobs, job details, etc.). Could you create a JIRA issue
> to fix this problem?
>
> 2) The reason why the Dispatcher first tries to recover the jobs before
> confirming the leadership is because it first tries to restore its internal
> state before it is accessible by other components and, thus, state changes.
> For example, the following problem could arise: Assume that you submit a
> job to the cluster. The cluster receives the JobGraph and persists it in
> ZooKeeper. Before the Dispatcher can acknowledge the job submission it
> fails. The client sees the failure and tries to re-submit the job. Now the
> Dispatcher is restarted and starts recovering the persisted jobs. If we
> don't wait for this to complete, then the retried job submission could
> succeed first because it is just faster. This would, however, let the job
> recovery fail because the Dispatcher is already executing this job (due to
> the re-submission) and the assumption is that recovered jobs are submitted
> first.
>
> The same applies if you should submit a modified job with the same JobID
> as a persisted job. Which job should the system then execute? The old one
> or the newly submitted job. By waiting to first complete the recovery, we
> give precedence to the persisted jobs.
>
> One could solve this problem also slightly differently, by only blocking
> the job submission while a recovery is happening. However, one should check
> that no other RPCs change the internal state in such a way that it
> interferes with the job recovery.
>
> Could you maybe open a JIRA issue for solving this problem?
>
> 3) The job recovery is mainly limited by the connection to your persistent
> storage system (HDFS or S3 I assume) where the JobGraphs are stored.
> Alternatively, you could split the number of executed jobs across multiple
> Flink clusters in order to decrease the number of jobs which need to be
> recovered in case of a failure.
>
> Thanks a lot for reporting and analysing this problem. This is definitely
> something we should improve!
>
> Cheers,
> Till
>
> On Fri, Aug 3, 2018 at 5:48 AM vino yang <yanghua1...@gmail.com> wrote:
>
>> Hi Joey,
>>
>> Good question!
>> I will copy it to Till and Chesnay who know this part of the
>> implementation.
>>
>> Thanks, vino.
>>
>> 2018-08-03 11:09 GMT+08:00 Joey Echeverria <jechever...@splunk.com>:
>>
>>> I don’t have logs available yet, but I do have some information from ZK.
>>>
>>> The culprit appears to be the /flink/default/leader/dispatcher_lock
>>> znode.
>>>
>>> I took a look at the dispatcher code here: https://github.com/
>>> apache/flink/blob/master/flink-runtime/src/main/java/
>>> org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785
>>>
>>> And it looks to me that when leadership is granted it will perform job
>>> recovery on all jobs before it writes the new leader information to
>>> the /flink/default/leader/dispatcher_lock znode.
>>>
>>> So this leaves me with three questions:
>>>
>>> 1) Why does the web monitor specifically have to wait for the dispatcher?
>>> 2) Is there a reason why the dispatcher can’t write the lock until after
>>> job recovery?
>>> 3) Is there anything I can/should be doing to speed up job recovery?
>>>
>>> Thanks!
>>>
>>> -Joey
>>>
>>>
>>> On Aug 2, 2018, at 9:24 AM, Joey Echeverria <jechever...@splunk.com>
>>> wrote:
>>>
>>> Thanks or the tips Gary and Vino. I’ll try to reproduce it with test
>>> data and see if I can post some logs.
>>>
>>> I’ll also watch the leader znode to see if the election isn’t happening
>>> or if it’s not being retrieved.
>>>
>>> Thanks!
>>>
>>> -Joey
>>>
>>> On Aug 1, 2018, at 11:19 PM, Gary Yao <g...@data-artisans.com> wrote:
>>>
>>> Hi Joey,
>>>
>>> If the other components (e.g., Dispatcher, ResourceManager) are able to
>>> finish
>>> the leader election in a timely manner, I currently do not see a reason
>>> why it
>>> should take the REST server 20 - 45 minutes.
>>>
>>> You can check the contents of znode /flink/.../leader/rest_server_lock
>>> to see
>>> if there is indeed no leader, or if the leader information cannot be
>>> retrieved
>>> from ZooKeeper.
>>>
>>> If you can reproduce this in a staging environment with some test jobs,
>>> I'd
>>> like to see the ClusterEntrypoint/JobManager logs (perhaps on debug
>>> level).
>>>
>>> Best,
>>> Gary
>>>
>>> On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com
>>> > wrote:
>>>
>>>> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a
>>>> single Job Manager running. I’m using Zookeeper to store the fencing/leader
>>>> information and S3 to store the job manager state. We’ve been running
>>>> around 250 or so streaming jobs and we’ve noticed that if the job manager
>>>> pod is deleted, it takes something like 20-45 minutes for the job manager’s
>>>> REST endpoints and web UI to become available. Until it becomes available,
>>>> we get a 503 response from the HTTP server with the message "Could not
>>>> retrieve the redirect address of the current leader. Please try to
>>>> refresh.”.
>>>>
>>>> Has anyone else run into this?
>>>>
>>>> Are there any configuration settings I should be looking at to speed up
>>>> the availability of the HTTP endpoints?
>>>>
>>>> Thanks!
>>>>
>>>> -Joey
>>>
>>>
>>>
>>>
>>>
>>

Reply via email to