[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872720#comment-15872720 ] Joshua Caplan commented on SPARK-3877: -- Done, as SPARK-19649 . > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826517#comment-15826517 ] Marcelo Vanzin commented on SPARK-3877: --- [~j_caplan] can you open a new bug for that issue? > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826444#comment-15826444 ] Joshua Caplan commented on SPARK-3877: -- see also https://issues.apache.org/jira/browse/MAPREDUCE-6091 > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812752#comment-15812752 ] Joshua Caplan commented on SPARK-3877: -- I think you have created a race condition with this fix which I am encountering about 50% of the time, using Spark 1.6.3. I have configured YARN not to keep *any* recent jobs in memory, as some of my jobs get pretty large. yarn-site yarn.resourcemanager.max-completed-applications 0 The once-per-second call to getApplicationReport may thus encounter a RUNNING application followed by a not found application, and report a false negative. (typical) Executor log: 17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook 17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to shut down 17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared 17/01/09 19:31:24 INFO BlockManager: BlockManager stopped 17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped 17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext 17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Client log: 17/01/09 19:31:23 INFO Client: Application report for application_1483983939941_0056 (state: RUNNING) 17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 not found. Exception in thread "main" org.apache.spark.SparkException: Application application_1483983939941_0056 is killed at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014683#comment-15014683 ] sam commented on SPARK-3877: Actually ignore, as per comment in duplicate, can't seem to reproduce. > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011690#comment-15011690 ] sam commented on SPARK-3877: Is this really fixed?? I'm getting this on 1.5.0 using EMR. [~tgraves] [~vanzin] [~zsxwing] > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: yarn > Fix For: 1.1.1, 1.2.0 > > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180628#comment-14180628 ] Apache Spark commented on SPARK-3877: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/2748 The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Labels: yarn Fix For: 1.1.1, 1.2.0 When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174999#comment-14174999 ] Thomas Graves commented on SPARK-3877: -- [~vanzin] I agree. The user code should be exiting with non-zero or throwing on failure. If they aren't then there is nothing we can do about it, other then tell them to change their code to properly exit if they want to see failure status. Perhaps we should better document what they should do on failure too. Its basically the same I did for the exit codes in ApplicationMaster. It relies on user code exiting non-zero and throwing. The only other option would be for us to actually look at the details in the scheduler ourselves to try to determine what happened. ie we see Stage X failed or Y tasks failed, etc. I would say we do that later if its needed. The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174507#comment-14174507 ] Marcelo Vanzin commented on SPARK-3877: --- [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception = logError(Oops, something bad happened., e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {main()} deals with errors, that still may not result in a non-zero exit status. The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164993#comment-14164993 ] Apache Spark commented on SPARK-3877: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/2732 The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165124#comment-14165124 ] Thomas Graves commented on SPARK-3877: -- this looks like a dup of SPARK-2167. Or actually perhaps a subset of that since I think you only handle the yarn mode. Does this cover both client and cluster mode? The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org