Hi devs, In current yarn-client mode, we have several problem,
1. When AM lost connection with driver, it will just finish application with final status of SUCCESS, then YarnClientSchedulerBackend.MonitorThread will got application status with SUCCESS final status and then call sc.stop(). SparkContext stoped and program exit with a 0 exit code. For scheduler system, always use the exit code to judge if the application is successful. This make scheduler system and user don't know the job is failed. 2. In YarnClientSchedulerBackend.MonitorThread, even it got a yarn report with FAILED or KILLED final status. It just call sc.stop(), make program exit with code 0. When some user killed a wrong application, the real owner of the killed application still got a wrong SUCCESS status of it 's job. There are some history discuss on these two problem SPARK-3627 <https://issues.apache.org/jira/browse/SPARK-3627> SPARK-1516 <https://issues.apache.org/jira/browse/SPARK-1516>. But that was the result of a very early discussion. Now spark is widely used by various companies, and a lot of spark-related job scheduling systems have been developed accordingly. These problem make user confused and hard to manage their jobs. Hope to get more feedback from the developers, or is there any good way to avoid these problems. Below are some of my related pr about these two problems: https://github.com/apache/spark/pull/33780 https://github.com/apache/spark/pull/33780 Best regards Angers