[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274 ] Adam Binford commented on YARN-4771: {quote}{quote}However it may have issues with very long-running apps that churn a lot of containers, since the container state won't be released until the application completes. {quote} This is going to be problematic, impacting NM memory usage. {quote} We just started encountering this issue, though not NM memory usage. We have run-forever Spark Structured Streaming applications that use dynamic allocation to grab resources when they need it. After restarting our Node Managers, the recovery process can end up DoS'ing our Resource Manager, especially if we restart a large amount at once, as there can be thousands of tracked "completed" containers. We're also seeing issues with the servers running the Node Managers sometimes dying during the recovery process as well. It seems like there's multiple issues here but it mostly stems from keeping all containers for all time for active applications in the state store: * As part of the recovery process, the NM seems to send a "container released" message to the RM, which the RM just logs as "Thanks, I don't know what this container is though". This is what can cause DoS'ing of the RM * On the NM itself, it seems that part of the recovery process is actually trying to allocate resources for completed containers, resulting in the server running out of memory. We've only seen this a couple times so still trying to exactly track down what's happening. Our metrics show spikes of up to 100x the resources being used on the NM than the NM actually has resources (i.e. the NM is reporting terabytes of memory is allocated, but the node only has ~300 GiB of memory). The metrics might be a weird side effect of the recovery process that doesn't actually hurt things, but the nodes dying is what's concerning I'm still trying to track down all the moving pieces here, as traversing around the event passing system isn't easy to follow. So far I've just tracked this down for why containers are never removed from the state store until an application finishes. We use the rolling log aggregation so I'm currently trying to see if we can use that mechanism to release containers from the state store once the logs have been aggregated. But this would also be a non-issue if I could figure out why the other issues are happening and how to prevent them. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165736#comment-17165736 ] Jim Brennan commented on YARN-4771: --- Thanks [~ebadger]! > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164670#comment-17164670 ] Hudson commented on YARN-4771: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18470 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18470/]) YARN-4771. Some containers can be skipped during log aggregation after (ebadger: rev ac5f21dbef0f0ad4210e4027f53877760fa606a5) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163705#comment-17163705 ] Jim Brennan commented on YARN-4771: --- I think this is ready for review. cc: [~epayne], [~ebadger] > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162728#comment-17162728 ] Jim Brennan commented on YARN-4771: --- I don't think the TestFederationInterceptor unit test failure is related to this change. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162385#comment-17162385 ] Hadoop QA commented on YARN-4771: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 41s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 22s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 20s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 121 unchanged - 1 fixed = 121 total (was 122) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 36s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 17s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 81m 55s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26300/artifact/out/Dockerfile | | JIRA Issue | YARN-4771 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13008118/YARN-4771.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 388d1e38b552 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / d23cc9d85d8 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit | https:/
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090800#comment-17090800 ] Eric Payne commented on YARN-4771: -- bq. This is going to be problematic, impacting NM memory usage. My company has been using this feature for 4 years in production and we have not seen problems with it. bq. I think the right solution is to decouple log-aggregation state completely from the rest of the container-state, and persist that separately in state-store etc irrespective of container / application state. Perhaps, but that would be a somewhat invasive re-architecture. I think that can be addressed in a different JIRA. I would like to see this JIRA (YARN-4771) committed into YARN for branches 2.10 to trunk. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.2 >Reporter: Jason Darrell Lowe >Priority: Major > Attachments: YARN-4771.001.patch, YARN-4771.002.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231386#comment-15231386 ] Junping Du commented on YARN-4771: -- 002 patch LGTM. An additional fix is we'd better to use MonotonicTime to replace System.currentTimeMillis() for tracking timeout - just an optional comment, we can address here or in a separated jira. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.2 >Reporter: Jason Lowe >Priority: Critical > Attachments: YARN-4771.001.patch, YARN-4771.002.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184991#comment-15184991 ] Jason Lowe commented on YARN-4771: -- The problem occurs because removeVeryOldStoppedContainersFromCache will remove containers from the state store that have completed at least yarn.nodemanager.duration-to-track-stopped-containers milliseconds ago. Once the container state is removed from the state store there's nothing to recover for that container when the NM restarts. With no information about that container to recover, the log aggregation service doesn't know it needs to aggregate the logs for that container, so the container is skipped during log aggregation. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.2 >Reporter: Jason Lowe > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)