?????? AdaptiveScheduler stopped without exception

yanjie Tue, 24 Aug 2021 19:28:46 -0700

&nbsp;@Till Rohrmann, Thanks for your&nbsp;clear explanation




------------------&nbsp;????????&nbsp;------------------
??????:                                                                         
                                               "Till Rohrmann"                  
                                                                  
<trohrm...@apache.org&gt;;
????????:&nbsp;2021??8??24??(??????) ????8:51
??????:&nbsp;"yanjie"<gyj199...@qq.com&gt;;
????:&nbsp;"user"<user@flink.apache.org&gt;;
????:&nbsp;Re: AdaptiveScheduler stopped without exception



Hi Yanjie,

The observed exception in the logs is just a side effect of the shut down 
procedure. It is a bug that shutting down the Dispatcher will result in a fatal 
exception coming from the&nbsp;ApplicationDispatcherBootstrap. I've created a 
ticket in order to fix it [1].


The true reason for stopping the&nbsp;SessionDispatcherLeaderProcess is that 
the DefaultDispatcherRunner&nbsp;lost its leadership. Unfortunately, we don't 
log this event on info. If you enable debug log level then you should&nbsp;see 
it. What happens when the Dispatcher loses leadership is that the Dispatcher 
component will be stopped. I will improve the logging of the 
DefaultDispatcherRunner to better state when it gains and loses leadership [2]. 
I hope this will make the logs easier to understand.


In the second job manager log, it is effectively the same. Just with the 
difference that first the ResourceManager loses its leadership. It seems as if 
the cause for the leadership loss could be that 172.18.0.1:443 is no longer 
reachable (probably the K8s API server).


[1]&nbsp;https://issues.apache.org/jira/browse/FLINK-23946
[2]&nbsp;https://issues.apache.org/jira/browse/FLINK-23947


Cheers,
Till


On Tue, Aug 24, 2021 at 9:56 AM yanjie <gyj199...@qq.com&gt; wrote:

Hi all,&nbsp;


I run a Application Cluster on Azure K8s, the job works fine for a duration, 
then jobmanager catches an exception:


org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.       
at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:415)
 ~[flink-dist_2.11-1.13.0.jar:1.13.0]
        at 
org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) 
~[flink-dist_2.11-1.13.0.jar:1.13.0]
        at 
org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
 ~[flink-dist_2.11-1.13.0.jar:1.13.0]
        at 
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) 
~[flink-dist_2.11-1.13.0.jar:1.13.0]
&nbsp; &nbsp; ...... omit

without any other exception before. Then jobmanager executes stopping steps and 
shutdown.
Because there's no other exception before, I don't know why 'AdaptiveScheduler 
is being stopped'.


My question:
What causes this issue(flink-jobmanager-1593852-jgwjt.log)?
Is network issuse caused this exception?(as encountered in 
flink-jobmanager-1593852-kr22z.log)?
Why first jobmanager(flink-jobmanager-1593852-jgwjt) doesn't throw any 
exception before?


Logs:
Attached log files contain jobmanager&amp;taskmanager's log. I configure k8s-HA 
with jobmanager's parallelism=1 (Whether set jobmangert's parallelism=1 or 2, 
both will recurrent)
flink-jobmanager-1593852-jgwjt.log:&nbsp;&nbsp;
works fine until '2021-08-23 05:08:25'


flink-jobmanager-1593852-kr22z.log:&nbsp;
start from '2021-08-23 05:08:35' and restore my job, works fine for a duration, 
then at '2021-08-23 14:24:15'&nbsp;
, jobmanager looks like occur network issue (may be Azure k8s's network issue, 
lead to flink cann't operate configmap, loose leader after k8s-ha lease 
duration).&nbsp;
Until '2021-08-23 14:24:32', this jobmanager catch exception 'AdaptiveScheduler 
is being stopped' again, and then shutdown.


flink--taskexecutor-0-flink-taskmanager-1593852-56dfcd95bc-hvnps.log:
Contains taskmanager's logs from beginning to '2021-08-23 09:15:24'. Covered 
the first jobmanager's (flink-jobmanager-1593852-jgwjt) lifecircle.




Background:
Deployment&amp;Configuration
I follow this page : 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster
deploy a Application Cluster to run my job. And add configurations for high 
availability on Kubernetes and use reactive scheduler mode.
Attached yaml files contain 'flink-config' &amp; 'flink-jobmanager' &amp; 
'flink-taskmanager' configurations.


Other experiences
In the previous test, when deploy my flink job on Azure K8s cluster, I 
encounter 'network issue' once, this issue will lead to master jobmanager can't 
renew configmap for a while,&nbsp;
and then the standby jobmanager will be elected as leader, then when previous 
leader's network recovered, it knows it is not a leader any more, then 
shutdown. Because k8s's default
configuration 'backoffLimit=6', my flink job will be removed finally.
I'm fixing this issue by increasing k8s ha's configurations, as this official 
docment says: 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#advanced-high-availability-kubernetes-options




My analyse:
Both jobmanager's log files contain same exception: 'AdaptiveScheduler is being 
stopped'. First jobmanager doesn't print any exception before.
The second jobmanager's print network exception, this may infer that this is 
caused by a network issue.

And I really encounter 'network issue' in the previous test and the fix job is 
on going, May be this exception is also caused by 'network issue'.


The reason why I&nbsp;raised this is: the first jobmanager doesn't print any 
information, I wonder why this happens.

?????? AdaptiveScheduler stopped without exception

Reply via email to