[ https://issues.apache.org/jira/browse/YARN-11698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849809#comment-17849809 ]
ASF GitHub Bot commented on YARN-11698: --------------------------------------- Kimahriman opened a new pull request, #6845: URL: https://github.com/apache/hadoop/pull/6845 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR Stores containers pending log aggregation in the NodeManager state store so logs can still be aggregated for complete containers after a Node Manager restart. This undoes and replaces https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach that doesn't involve storing containers indefinitely until the application finishes. The original approach has several issues, some of which were mentioned in the JIRA but decided it was ok: - Long running applications can lead to a large number of containers being stored indefinitely in the state store as well as in memory on the Node Manager - On restarts, the Node Manager has to do a lot of work fully recovering all of these complete containers just so they can be registered for log aggregation again - This leads to large heartbeat messages to the Resource Manager that can DoS or OOM it - This ignores the fact that users may not have log aggregation enabled or may have rolling log aggregation enabled, meaning containers are stored even after there is no need to worry about aggregating the logs in the future Instead, this adds a new state store entry for containers pending log aggregation. This solves all the above issues, while still providing the same guarantees about logs being aggregated after a Node Manager restart. ### How was this patch tested? New UTs added ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [x] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Finished containers shouldn't be stored indefinitely in the NM state store > -------------------------------------------------------------------------- > > Key: YARN-11698 > URL: https://issues.apache.org/jira/browse/YARN-11698 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 3.4.0 > Reporter: Adam Binford > Priority: Major > > https://issues.apache.org/jira/browse/YARN-4771 updated the container > tracking in the state store to only remove containers when their application > ends, in order to make sure all containers logs get aggregated even during NM > restarts. This can lead to a significant number of containers building up in > the state store and a lot of things to recover. Since this was purely for > making sure logs get aggregated, it could be done smarter that takes into > account both rolling log aggregation or not having log aggregation enabled at > all. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org