[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

Steven Rand (JIRA) Mon, 21 Aug 2017 15:26:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135938#comment-16135938
 ]


Steven Rand commented on YARN-4227:
-----------------------------------

I'm seeing a similar issue on what's roughly branch-2 (CDH 5.11.0), with the 
error being:

{code}
2017-06-27 16:32:39,381 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Preemption 
Timer,5,main] threw an Exception.
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:687)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread$PreemptContainersTask.run(FSPreemptionThread.java:230)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
{code}

This error, which causes the FSPreemptionThead to die, and thereby crashes the 
RM, seems to be correlated with NodeManagers being marked unhealthy due to lack 
of local disk space during large shuffles. I haven't confirmed, but presumably 
the unhealthy nodes are removed while we're waiting for the lock, and no longer 
exist when we call {{releaseContainer}}.

I'm curious as to whether others are seeing this as well on recent versions, in 
which case maybe this is worth reopening?

> FairScheduler: RM quits processing expired container from a removed node
> ------------------------------------------------------------------------
>
>                 Key: YARN-4227
>                 URL: https://issues.apache.org/jira/browse/YARN-4227
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.3.0, 2.5.0, 2.7.1
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_000012 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_000012 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_000012 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>    OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_000012
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

Reply via email to