[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-08 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573232#comment-16573232
 ] 

Jason Lowe commented on YARN-8609:
--

This JIRA does mention all those things, and now it points to YARN-3998 as the 
fix (I just linked the two JIRAs).  If we resolve it as fixed with a patch that 
only truncates individual diagnostic messages, that will not prevent an OOM if 
something adds a ton of separate diagnostic messages to a container.  It would 
be a partial fix to the OOM while YARN-3998 is a complete fix.


> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-07 Thread Xianghao Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572565#comment-16572565
 ] 

Xianghao Lu commented on YARN-8609:
---

{quote}

Those looking for a JIRA and finding the summary matching their symptoms should 
be directed to YARN-3998since that alone is sufficient to address that problem.

{quote}


-YARN-3998- did slove the problem. However,  I am worried that -YARN-3998- does 
not mention oom, too much diagnostic info, large container status, etc. in 
summary or description.

 Closed as a duplicate

> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-07 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571732#comment-16571732
 ] 

Jason Lowe commented on YARN-8609:
--

bq. Indeed, it would not take up too much memory if running with YARN-3998.

Then I propose this be closed as a duplicate.  Those looking for a JIRA and 
finding the summary matching their symptoms should be directed to YARN-3998 
since that alone is sufficient to address that problem.

bq. if we do truncation in for loop, all kinds of diagnostic info will retain. 
This is what I want to say and it is a small improvement.

We can add the ability to truncate individual diagnostic messages in a separate 
improvement JIRA.  However as I mentioned above, 5000 may be too small of a 
default since it could end up truncating a critical "Caused by" towards the end 
of a large stacktrace.

> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-06 Thread Xianghao Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571033#comment-16571033
 ] 

Xianghao Lu commented on YARN-8609:
---

Thank you for your patience!
I know what you say, and my hadoop version is 2.7.2 which don't contain the 
change in YARN-3998. Indeed, it would not take up too many memory if running 
with -YARN-3998.-
However, for example, if the raw diagnostic is largeExceptionMessage + 
fixedString + fixedString + ... , all meaningful fixedString will be discarded. 
So, if we do truncation in for loop, all kinds of diagnostic info will retain. 
This is what I want to say and it is a small improvement.

> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-06 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570805#comment-16570805
 ] 

Jason Lowe commented on YARN-8609:
--

bq. As far as I know, there are two kinds of diagnostics info, one is fixed 
string, such as "Container is killed before being launched.\n", the other is 
exception message which may be very large, so I think we should just truncate 
exception message rather than the entire string made by for loop.

There should be only one way to store/update a container's diagnostics for 
recovery, and that's NMStateStoreService#storeContainerDiagnostics.  That 
method does not append but replaces the diagnostics.  The only call to that 
method is ContainerImpl#addDiagnostics which after YARN-3998 trims the 
diagnostics to the maximum configured length, keeping the most recently added 
characters.  The for loop is just for adding all the messages since it's 
implemented with variable arguments.  The most memory this method could take is 
 diagnosticsMaxSize + size_of_new_diagnostics which is then truncated to 
diagnosticsMaxSize at the end.  It will not persist, either in memory or in the 
state store, diagnostics beyond diagnosticsMaxSize.

If you're not running with YARN-3998 in your build then it appears the 
necessary changes are already addressed by YARN-3998.  It certainly looks as if 
that JIRA should have addressed your issue if it's configured to the default or 
a reasonable limit.  Are you running on a version that contains that change?  
If so then I'm wondering how you were able to get a 27MB diagnostic message 
into the state store.


> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-02 Thread Xianghao Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567706#comment-16567706
 ] 

Xianghao Lu commented on YARN-8609:
---

Thanks for your comment.
I have updated my patch according to your suggestion, but I find there is a 
same parameter(NM_CONTAINER_DIAGNOSTICS_MAXIMUM_SIZE) in YARN-3998.
As far as I know, there are two kinds of diagnostics info, one is fixed string, 
such as "Container is killed before being launched.\n", the other is exception 
message which may be very large, so I think we should just truncate exception 
message rather than the entire string made by for loop.

> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-01 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565511#comment-16565511
 ] 

Jason Lowe commented on YARN-8609:
--

Thanks for the report and patch!

IMHO any truncation should not be tied to recovery, as the NM could OOM just 
tracking container diagnostics.  Recovery involves reloading what was already 
in memory before the crash/restart.  If the diagnostics of a container were 27M 
in the recovery file then that means it was 27M in the NM heap before it 
recovered as well.

Recovery does take more memory to recover than normal operations, and YARN-8242 
and the work there will help reduce that load.  Rather than forcing a rather 
draconian truncation (27M to 5000 bytes is rather extreme), this should be a 
configurable setting and applied when diagnostics are added to a container 
rather than upon recovery.  See ContainerImpl#addDiagnostics.  Otherwise 
reported container statuses will suddenly will change when the NM restarts and 
that is counter to the goals of the NM recovery feature.


> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8609) NM oom because of large container statuses

2018-08-01 Thread Xianghao Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565195#comment-16565195
 ] 

Xianghao Lu commented on YARN-8609:
---

Seems to be related to https://issues.apache.org/jira/browse/YARN-2115,  
[~jianhe] , Woud you like to review the patch?

> NM oom because of large container statuses
> --
>
> Key: YARN-8609
> URL: https://issues.apache.org/jira/browse/YARN-8609
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger recovering, as a result , NodeManger will be failed to start 
> because of oom.
> In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 container is very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org