Prabhu Joseph created SPARK-13182: ------------------------------------- Summary: Spark Executor retries infinitely Key: SPARK-13182 URL: https://issues.apache.org/jira/browse/SPARK-13182 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: Prabhu Joseph Priority: Minor Fix For: 1.5.2
When a Spark job (Spark-1.5.2) is submitted with a single executor and if user passes some wrong JVM arguments with spark.executor.extraJavaOptions, the first executor fails. But the job keeps on retrying, creating a new executor and failing every time, until CTRL-C is pressed. ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar Here when user submits job with ConcGCThreads 16 which is greater than ParallelGCThreads, JVM will crash. But the job does not exit, keeps on creating executors and retrying. .......... 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now RUNNING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2846 removed: Command exited with code 1 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2846 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 cores 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2847 removed: Command exited with code 1 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2847 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 cores 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now RUNNING Spark should not fall into a trap on these kind of user errors on a production cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org