Prabhu Joseph created SPARK-13182:
-------------------------------------

             Summary: Spark Executor retries infinitely
                 Key: SPARK-13182
                 URL: https://issues.apache.org/jira/browse/SPARK-13182
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.5.2
            Reporter: Prabhu Joseph
            Priority: Minor
             Fix For: 1.5.2


  When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
user passes some wrong JVM arguments with spark.executor.extraJavaOptions, the 
first executor fails. But the job keeps on retrying, creating a new executor 
and failing every time, until CTRL-C is pressed. 

./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
"spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
/SPARK/SimpleApp.jar

Here when user submits job with ConcGCThreads 16 which is greater than 
ParallelGCThreads, JVM will crash. But the job does not exit, keeps on creating 
executors and retrying.
..........
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 
GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2846 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2846 is now RUNNING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
app-20160201065319-0014/2846 removed: Command exited with code 1
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2846
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
(10.10.72.145:36558) with 12 cores
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 
GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2847 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
app-20160201065319-0014/2847 removed: Command exited with code 1
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2847
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
(10.10.72.145:36558) with 12 cores
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 
GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2848 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160201065319-0014/2848 is now RUNNING

Spark should not fall into a trap on these kind of user errors on a production 
cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to