[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2024-01-18 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274
 ] 

Adam Binford commented on YARN-4771:


{quote}{quote}However it may have issues with very long-running apps that churn 
a lot of containers, since the container state won't be released until the 
application completes.
{quote}
This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have 
run-forever Spark Structured Streaming applications that use dynamic allocation 
to grab resources when they need it. After restarting our Node Managers, the 
recovery process can end up DoS'ing our Resource Manager, especially if we 
restart a large amount at once, as there can be thousands of tracked 
"completed" containers. We're also seeing issues with the servers running the 
Node Managers sometimes dying during the recovery process as well. It seems 
like there's multiple issues here but it mostly stems from keeping all 
containers for all time for active applications in the state store:
 * As part of the recovery process, the NM seems to send a "container released" 
message to the RM, which the RM just logs as "Thanks, I don't know what this 
container is though". This is what can cause DoS'ing of the RM
 * On the NM itself, it seems that part of the recovery process is actually 
trying to allocate resources for completed containers, resulting in the server 
running out of memory. We've only seen this a couple times so still trying to 
exactly track down what's happening. Our metrics show spikes of up to 100x the 
resources being used on the NM than the NM actually has resources (i.e. the NM 
is reporting terabytes of memory is allocated, but the node only has ~300 GiB 
of memory). The metrics might be a weird side effect of the recovery process 
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around 
the event passing system isn't easy to follow. So far I've just tracked this 
down for why containers are never removed from the state store until an 
application finishes. We use the rolling log aggregation so I'm currently 
trying to see if we can use that mechanism to release containers from the state 
store once the logs have been aggregated. But this would also be a non-issue if 
I could figure out why the other issues are happening and how to prevent them.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-27 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165736#comment-17165736
 ] 

Jim Brennan commented on YARN-4771:
---

Thanks [~ebadger]!

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-24 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164670#comment-17164670
 ] 

Hudson commented on YARN-4771:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18470 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18470/])
YARN-4771. Some containers can be skipped during log aggregation after 
(ebadger: rev ac5f21dbef0f0ad4210e4027f53877760fa606a5)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java


> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163705#comment-17163705
 ] 

Jim Brennan commented on YARN-4771:
---

I think this is ready for review.
cc: [~epayne], [~ebadger]

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162728#comment-17162728
 ] 

Jim Brennan commented on YARN-4771:
---

I don't think the TestFederationInterceptor unit test failure is related to 
this change.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-21 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162385#comment-17162385
 ] 

Hadoop QA commented on YARN-4771:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 22s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
20s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 121 unchanged - 1 fixed = 121 total (was 122) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 36s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 17s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 81m 55s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26300/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-4771 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008118/YARN-4771.003.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 388d1e38b552 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / d23cc9d85d8 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| unit | 
https:/

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-04-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090800#comment-17090800
 ] 

Eric Payne commented on YARN-4771:
--

bq. This is going to be problematic, impacting NM memory usage.
My company has been using this feature for 4 years in production and we have 
not seen problems with it.
bq. I think the right solution is to decouple log-aggregation state completely 
from the rest of the container-state, and persist that separately in 
state-store etc irrespective of container / application state.
Perhaps, but that would be a somewhat invasive re-architecture. I think that 
can be addressed in a different JIRA.
I would like to see this JIRA (YARN-4771) committed into YARN for branches 2.10 
to trunk.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.2
>Reporter: Jason Darrell Lowe
>Priority: Major
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-04-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231386#comment-15231386
 ] 

Junping Du commented on YARN-4771:
--

002 patch LGTM. 
An additional fix is we'd better to use MonotonicTime to replace 
System.currentTimeMillis() for tracking timeout - just an optional comment, we 
can address here or in a separated jira.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-03-08 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184991#comment-15184991
 ] 

Jason Lowe commented on YARN-4771:
--

The problem occurs because removeVeryOldStoppedContainersFromCache will remove 
containers from the state store that have completed at least 
yarn.nodemanager.duration-to-track-stopped-containers milliseconds ago.  Once 
the container state is removed from the state store there's nothing to recover 
for that container when the NM restarts.  With no information about that 
container to recover, the log aggregation service doesn't know it needs to 
aggregate the logs for that container, so the container is skipped during log 
aggregation.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)