Hi Yanjie,

The observed exception in the logs is just a side effect of the shut down
procedure. It is a bug that shutting down the Dispatcher will result in a
fatal exception coming from the ApplicationDispatcherBootstrap. I've
created a ticket in order to fix it [1].

The true reason for stopping the SessionDispatcherLeaderProcess is that the
DefaultDispatcherRunner lost its leadership. Unfortunately, we don't log
this event on info. If you enable debug log level then you should see it.
What happens when the Dispatcher loses leadership is that the Dispatcher
component will be stopped. I will improve the logging of the
DefaultDispatcherRunner to better state when it gains and loses leadership
[2]. I hope this will make the logs easier to understand.

In the second job manager log, it is effectively the same. Just with the
difference that first the ResourceManager loses its leadership. It seems as
if the cause for the leadership loss could be that 172.18.0.1:443 is no
longer reachable (probably the K8s API server).

[1] https://issues.apache.org/jira/browse/FLINK-23946
[2] https://issues.apache.org/jira/browse/FLINK-23947

Cheers,
Till

On Tue, Aug 24, 2021 at 9:56 AM yanjie <gyj199...@qq.com> wrote:

> Hi all,
>
> I run a Application Cluster on Azure K8s, the job works fine for a
> duration, then jobmanager catches an exception:
>
> org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.
>
> at
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:415)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
> at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
> ~[flink-dist_2.11-1.13.0.jar:1.13.0]
>     ...... omit
>
>
> without any other exception before. Then jobmanager executes stopping
> steps and shutdown.
> Because there's no other exception before, I don't know why
> 'AdaptiveScheduler is being stopped'.
>
> *My question:*
> What causes this issue(flink-jobmanager-1593852-jgwjt.log)?
> Is network issuse caused this exception?(as encountered in
> flink-jobmanager-1593852-kr22z.log)?
> Why first jobmanager(flink-jobmanager-1593852-jgwjt) doesn't throw any
> exception before?
>
> *Logs:*
> Attached log files contain jobmanager&taskmanager's log. I configure
> k8s-HA with jobmanager's parallelism=1 (Whether set jobmangert's
> parallelism=1 or 2, both will recurrent)
> *flink-jobmanager-1593852-jgwjt.log*:
> works fine until '2021-08-23 05:08:25'
>
> *flink-jobmanager-1593852-kr22z.log*:
> start from '2021-08-23 05:08:35' and restore my job, works fine for a
> duration, then at '2021-08-23 14:24:15'
> , jobmanager looks like occur network issue (may be Azure k8s's network
> issue, lead to flink cann't operate configmap, loose leader after k8s-ha
> lease duration).
> Until '2021-08-23 14:24:32', this jobmanager catch exception
> 'AdaptiveScheduler is being stopped' again, and then shutdown.
>
> *flink--taskexecutor-0-flink-taskmanager-1593852-56dfcd95bc-hvnps.log*:
> Contains taskmanager's logs from beginning to '2021-08-23 09:15:24'.
> Covered the first jobmanager's (flink-jobmanager-1593852-jgwjt) lifecircle.
>
>
> *Background*:
> *Deployment&Configuration*
> I follow this page :
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster
> deploy a Application Cluster to run my job. And add configurations for
> high availability on Kubernetes and use reactive scheduler mode.
> Attached yaml files contain 'flink-config' & 'flink-jobmanager' &
> 'flink-taskmanager' configurations.
>
> *Other experiences*
> In the previous test, when deploy my flink job on Azure K8s cluster, I
> encounter 'network issue' once, this issue will lead to master jobmanager
> can't renew configmap for a while,
> and then the standby jobmanager will be elected as leader, then when
> previous leader's network recovered, it knows it is not a leader any more,
> then shutdown. Because k8s's default
> configuration '*backoffLimit=6*', my flink job will be removed finally.
> I'm fixing this issue by increasing k8s ha's configurations, as this
> official docment says:
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#advanced-high-availability-kubernetes-options
>
>
> *My analyse*:
> Both jobmanager's log files contain same exception: 'AdaptiveScheduler is
> being stopped'. First jobmanager doesn't print any exception before.
> The second jobmanager's print network exception, this may infer that this
> is caused by a network issue.
> And I really encounter 'network issue' in the previous test and the fix
> job is on going, May be this exception is also caused by 'network issue'.
>
> The reason why I raised this is: the first jobmanager doesn't print any
> information, I wonder why this happens.
>

Reply via email to