[ 
https://issues.apache.org/jira/browse/SPARK-30529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30529:
---------------------------------
    Description: 
currently when you give a bad configuration for accelerator aware scheduling to 
the executor, the Executors can die but its hard for the user to know why.  The 
executor dies and logs in its log files what went wrong but many times it hard 
to find those logs because the executor hasn't registered yet.  Since it hasn't 
registered the executor doesn't show up on UI to see log files.

One specific example is you give a discovery script that that doesn't find all 
the GPUs:

{code}
20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@10.28.9.112:44403
20/01/16 08:59:24 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
addresses: 0 is less than what the user requested: 2)
 at scala.Predef$.require(Predef.scala:281)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
{code}

 

Figure out a better way of logging or letting user know  what error occurred 
when the executor dies before registering

  was:
currently when you give a bad configuration for accelerator aware scheduling to 
the executor, the Executors can die but its hard for the user to know why.  The 
executor dies and logs in its log files what went wrong but many times it hard 
to find those logs because the executor hasn't registered yet.  Since it hasn't 
registered the executor doesn't show up on UI to see log files.

One specific example is you give a discovery script that that doesn't find all 
the GPUs:

20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@10.28.9.112:44403
20/01/16 08:59:24 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
addresses: 0 is less than what the user requested: 2)
 at scala.Predef$.require(Predef.scala:281)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)

 

Figure out a better way of logging or letting user know  what error occurred 
when the executor dies before registering


> Improve error messages when Executor dies before registering with driver
> ------------------------------------------------------------------------
>
>                 Key: SPARK-30529
>                 URL: https://issues.apache.org/jira/browse/SPARK-30529
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Thomas Graves
>            Priority: Major
>
> currently when you give a bad configuration for accelerator aware scheduling 
> to the executor, the Executors can die but its hard for the user to know why. 
>  The executor dies and logs in its log files what went wrong but many times 
> it hard to find those logs because the executor hasn't registered yet.  Since 
> it hasn't registered the executor doesn't show up on UI to see log files.
> One specific example is you give a discovery script that that doesn't find 
> all the GPUs:
> {code}
> 20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@10.28.9.112:44403
> 20/01/16 08:59:24 ERROR Inbox: Ignoring error
> java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
> addresses: 0 is less than what the user requested: 2)
>  at scala.Predef$.require(Predef.scala:281)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
> {code}
>  
> Figure out a better way of logging or letting user know  what error occurred 
> when the executor dies before registering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to