[jira] [Commented] (YARN-8609) NM oom because of large container statuses

Jason Lowe (JIRA) Wed, 01 Aug 2018 08:44:12 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565511#comment-16565511
 ]


Jason Lowe commented on YARN-8609:
----------------------------------

Thanks for the report and patch!

IMHO any truncation should not be tied to recovery, as the NM could OOM just 
tracking container diagnostics.  Recovery involves reloading what was already 
in memory before the crash/restart.  If the diagnostics of a container were 27M 
in the recovery file then that means it was 27M in the NM heap before it 
recovered as well.

Recovery does take more memory to recover than normal operations, and YARN-8242 
and the work there will help reduce that load.  Rather than forcing a rather 
draconian truncation (27M to 5000 bytes is rather extreme), this should be a 
configurable setting and applied when diagnostics are added to a container 
rather than upon recovery.  See ContainerImpl#addDiagnostics.  Otherwise 
reported container statuses will suddenly will change when the NM restarts and 
that is counter to the goals of the NM recovery feature.


> NM oom because of large container statuses
> ------------------------------------------
>
>                 Key: YARN-8609
>                 URL: https://issues.apache.org/jira/browse/YARN-8609
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Xianghao Lu
>            Priority: Major
>         Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-8609) NM oom because of large container statuses

Reply via email to