Steven Rand created YARN-8903:
---------------------------------

             Summary: when NM becomes unhealthy due to local disk usage, have 
option to kill application using most space instead of releasing all containers 
on node
                 Key: YARN-8903
                 URL: https://issues.apache.org/jira/browse/YARN-8903
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 3.1.1
            Reporter: Steven Rand


We sometimes experience an issue in which a single application, usually a Spark 
job, causes at least one node in a YARN cluster to become unhealthy by filling 
up the local dir(s) on that node past the threshold at which the node is 
considered unhealthy.

When this happens, the impact is potentially large depending on what else is 
running on that node, as all containers on that node are lost. Sometimes not 
much else is running on the node and it's fine, but other times we lose AM 
containers from other apps and/or non-AM containers with long-running tasks.

I thought that it would be helpful to add an option (default false) whereby if 
a node is going to become unhealthy due to full local disk(s), it instead 
identifies the application that's using the most local disk space on that node, 
and kills that application. (Roughly analogous to how the OOM killer in Linux 
picks one process to kill rather than letting the machine crash.)

The benefit is that only one application is impacted, and no other application 
loses any containers. This prevents one user's poorly written code that 
shuffles/spills huge amounts of data from negatively impacting other users.

The downside is that we're killing the entire application, not just the task(s) 
responsible for the local disk usage. I believe it's necessary to kill the 
whole application instead of identifying the container running the relevant 
task(s), because doing so would require more knowledge of the internal state of 
aux services responsible for shuffling than what YARN has according to my 
understanding.

If this seems reasonable, I can work on the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to