[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573232#comment-16573232 ] Jason Lowe commented on YARN-8609: -- This JIRA does mention all those things, and now it points to YARN-3998 as the fix (I just linked the two JIRAs). If we resolve it as fixed with a patch that only truncates individual diagnostic messages, that will not prevent an OOM if something adds a ton of separate diagnostic messages to a container. It would be a partial fix to the OOM while YARN-3998 is a complete fix. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572565#comment-16572565 ] Xianghao Lu commented on YARN-8609: --- {quote} Those looking for a JIRA and finding the summary matching their symptoms should be directed to YARN-3998since that alone is sufficient to address that problem. {quote} -YARN-3998- did slove the problem. However, I am worried that -YARN-3998- does not mention oom, too much diagnostic info, large container status, etc. in summary or description. Closed as a duplicate > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571732#comment-16571732 ] Jason Lowe commented on YARN-8609: -- bq. Indeed, it would not take up too much memory if running with YARN-3998. Then I propose this be closed as a duplicate. Those looking for a JIRA and finding the summary matching their symptoms should be directed to YARN-3998 since that alone is sufficient to address that problem. bq. if we do truncation in for loop, all kinds of diagnostic info will retain. This is what I want to say and it is a small improvement. We can add the ability to truncate individual diagnostic messages in a separate improvement JIRA. However as I mentioned above, 5000 may be too small of a default since it could end up truncating a critical "Caused by" towards the end of a large stacktrace. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571033#comment-16571033 ] Xianghao Lu commented on YARN-8609: --- Thank you for your patience! I know what you say, and my hadoop version is 2.7.2 which don't contain the change in YARN-3998. Indeed, it would not take up too many memory if running with -YARN-3998.- However, for example, if the raw diagnostic is largeExceptionMessage + fixedString + fixedString + ... , all meaningful fixedString will be discarded. So, if we do truncation in for loop, all kinds of diagnostic info will retain. This is what I want to say and it is a small improvement. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570805#comment-16570805 ] Jason Lowe commented on YARN-8609: -- bq. As far as I know, there are two kinds of diagnostics info, one is fixed string, such as "Container is killed before being launched.\n", the other is exception message which may be very large, so I think we should just truncate exception message rather than the entire string made by for loop. There should be only one way to store/update a container's diagnostics for recovery, and that's NMStateStoreService#storeContainerDiagnostics. That method does not append but replaces the diagnostics. The only call to that method is ContainerImpl#addDiagnostics which after YARN-3998 trims the diagnostics to the maximum configured length, keeping the most recently added characters. The for loop is just for adding all the messages since it's implemented with variable arguments. The most memory this method could take is diagnosticsMaxSize + size_of_new_diagnostics which is then truncated to diagnosticsMaxSize at the end. It will not persist, either in memory or in the state store, diagnostics beyond diagnosticsMaxSize. If you're not running with YARN-3998 in your build then it appears the necessary changes are already addressed by YARN-3998. It certainly looks as if that JIRA should have addressed your issue if it's configured to the default or a reasonable limit. Are you running on a version that contains that change? If so then I'm wondering how you were able to get a 27MB diagnostic message into the state store. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567706#comment-16567706 ] Xianghao Lu commented on YARN-8609: --- Thanks for your comment. I have updated my patch according to your suggestion, but I find there is a same parameter(NM_CONTAINER_DIAGNOSTICS_MAXIMUM_SIZE) in YARN-3998. As far as I know, there are two kinds of diagnostics info, one is fixed string, such as "Container is killed before being launched.\n", the other is exception message which may be very large, so I think we should just truncate exception message rather than the entire string made by for loop. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565511#comment-16565511 ] Jason Lowe commented on YARN-8609: -- Thanks for the report and patch! IMHO any truncation should not be tied to recovery, as the NM could OOM just tracking container diagnostics. Recovery involves reloading what was already in memory before the crash/restart. If the diagnostics of a container were 27M in the recovery file then that means it was 27M in the NM heap before it recovered as well. Recovery does take more memory to recover than normal operations, and YARN-8242 and the work there will help reduce that load. Rather than forcing a rather draconian truncation (27M to 5000 bytes is rather extreme), this should be a configurable setting and applied when diagnostics are added to a container rather than upon recovery. See ContainerImpl#addDiagnostics. Otherwise reported container statuses will suddenly will change when the NM restarts and that is counter to the goals of the NM recovery feature. > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger start with recovering, as a result , NodeManger will be > failed to start because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 containers are very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8609) NM oom because of large container statuses
[ https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565195#comment-16565195 ] Xianghao Lu commented on YARN-8609: --- Seems to be related to https://issues.apache.org/jira/browse/YARN-2115, [~jianhe] , Woud you like to review the patch? > NM oom because of large container statuses > -- > > Key: YARN-8609 > URL: https://issues.apache.org/jira/browse/YARN-8609 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Xianghao Lu >Priority: Major > Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg > > > Sometimes, NodeManger will send large container statuses to ResourceManager > when NodeManger recovering, as a result , NodeManger will be failed to start > because of oom. > In my case, the large container statuses size is 135M, which contain 11 > container statuses, and I find the diagnostics of 5 container is very > large(27M), so, I truncate the container diagnostics as the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org