[ 
https://issues.apache.org/jira/browse/YARN-11698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849809#comment-17849809
 ] 

ASF GitHub Bot commented on YARN-11698:
---------------------------------------

Kimahriman opened a new pull request, #6845:
URL: https://github.com/apache/hadoop/pull/6845

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   Stores containers pending log aggregation in the NodeManager state store so 
logs can still be aggregated for complete containers after a Node Manager 
restart. This undoes and replaces 
https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach 
that doesn't involve storing containers indefinitely until the application 
finishes. 
   
   The original approach has several issues, some of which were mentioned in 
the JIRA but decided it was ok:
   - Long running applications can lead to a large number of containers being 
stored indefinitely in the state store as well as in memory on the Node Manager
   - On restarts, the Node Manager has to do a lot of work fully recovering all 
of these complete containers just so they can be registered for log aggregation 
again
   - This leads to large heartbeat messages to the Resource Manager that can 
DoS or OOM it
   - This ignores the fact that users may not have log aggregation enabled or 
may have rolling log aggregation enabled, meaning containers are stored even 
after there is no need to worry about aggregating the logs in the future
   
   Instead, this adds a new state store entry for containers pending log 
aggregation. This solves all the above issues, while still providing the same 
guarantees about logs being aggregated after a Node Manager restart.
   
   ### How was this patch tested?
   New UTs added
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [x] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Finished containers shouldn't be stored indefinitely in the NM state store
> --------------------------------------------------------------------------
>
>                 Key: YARN-11698
>                 URL: https://issues.apache.org/jira/browse/YARN-11698
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 3.4.0
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/YARN-4771 updated the container 
> tracking in the state store to only remove containers when their application 
> ends, in order to make sure all containers logs get aggregated even during NM 
> restarts. This can lead to a significant number of containers building up in 
> the state store and a lot of things to recover. Since this was purely for 
> making sure logs get aggregated, it could be done smarter that takes into 
> account both rolling log aggregation or not having log aggregation enabled at 
> all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to