[jira] [Created] (YARN-11698) Finished containers shouldn't be stored indefinitely in the NM state store

2024-05-21 Thread Adam Binford (Jira)
Adam Binford created YARN-11698:
---

 Summary: Finished containers shouldn't be stored indefinitely in 
the NM state store
 Key: YARN-11698
 URL: https://issues.apache.org/jira/browse/YARN-11698
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.4.0
Reporter: Adam Binford


https://issues.apache.org/jira/browse/YARN-4771 updated the container tracking 
in the state store to only remove containers when their application ends, in 
order to make sure all containers logs get aggregated even during NM restarts. 
This can lead to a significant number of containers building up in the state 
store and a lot of things to recover. Since this was purely for making sure 
logs get aggregated, it could be done smarter that takes into account both 
rolling log aggregation or not having log aggregation enabled at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2024-01-18 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808274#comment-17808274
 ] 

Adam Binford edited comment on YARN-4771 at 1/18/24 3:54 PM:
-

 
{quote}However it may have issues with very long-running apps that churn a lot 
of containers, since the container state won't be released until the 
application completes.
{quote}
{quote}This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have 
run-forever Spark Structured Streaming applications that use dynamic allocation 
to grab resources when they need it. After restarting our Node Managers, the 
recovery process can end up DoS'ing our Resource Manager, especially if we 
restart a large amount at once, as there can be thousands of tracked 
"completed" containers. We're also seeing issues with the servers running the 
Node Managers sometimes dying during the recovery process as well. It seems 
like there's multiple issues here but it mostly stems from keeping all 
containers for all time for active applications in the state store:
 * As part of the recovery process, the NM seems to send a "container released" 
message to the RM, which the RM just logs as "Thanks, I don't know what this 
container is though". This is what can cause DoS'ing of the RM
 * On the NM itself, it seems that part of the recovery process is actually 
trying to allocate resources for completed containers, resulting in the server 
running out of memory. We've only seen this a couple times so still trying to 
exactly track down what's happening. Our metrics show spikes of up to 100x the 
resources being used on the NM than the NM actually has resources (i.e. the NM 
is reporting terabytes of memory is allocated, but the node only has ~300 GiB 
of memory). The metrics might be a weird side effect of the recovery process 
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around 
the event passing system isn't easy to follow. So far I've just tracked this 
down for why containers are never removed from the state store until an 
application finishes. We use the rolling log aggregation so I'm currently 
trying to see if we can use that mechanism to release containers from the state 
store once the logs have been aggregated. But this would also be a non-issue if 
I could figure out why the other issues are happening and how to prevent them.


was (Author: kimahriman):
{quote}{quote}However it may have issues with very long-running apps that churn 
a lot of containers, since the container state won't be released until the 
application completes.
{quote}
This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have 
run-forever Spark Structured Streaming applications that use dynamic allocation 
to grab resources when they need it. After restarting our Node Managers, the 
recovery process can end up DoS'ing our Resource Manager, especially if we 
restart a large amount at once, as there can be thousands of tracked 
"completed" containers. We're also seeing issues with the servers running the 
Node Managers sometimes dying during the recovery process as well. It seems 
like there's multiple issues here but it mostly stems from keeping all 
containers for all time for active applications in the state store:
 * As part of the recovery process, the NM seems to send a "container released" 
message to the RM, which the RM just logs as "Thanks, I don't know what this 
container is though". This is what can cause DoS'ing of the RM
 * On the NM itself, it seems that part of the recovery process is actually 
trying to allocate resources for completed containers, resulting in the server 
running out of memory. We've only seen this a couple times so still trying to 
exactly track down what's happening. Our metrics show spikes of up to 100x the 
resources being used on the NM than the NM actually has resources (i.e. the NM 
is reporting terabytes of memory is allocated, but the node only has ~300 GiB 
of memory). The metrics might be a weird side effect of the recovery process 
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around 
the event passing system isn't easy to follow. So far I've just tracked this 
down for why containers are never removed from the state store until an 
application finishes. We use the rolling log aggregation so I'm currently 
trying to see if we can use that mechanism to release containers from the state 
store once the logs have been aggregated. But this would also be a non-issue if 
I could figure out why the other issues are happening and how to prevent them.

> Some 

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2024-01-18 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808274#comment-17808274
 ] 

Adam Binford commented on YARN-4771:


{quote}{quote}However it may have issues with very long-running apps that churn 
a lot of containers, since the container state won't be released until the 
application completes.
{quote}
This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have 
run-forever Spark Structured Streaming applications that use dynamic allocation 
to grab resources when they need it. After restarting our Node Managers, the 
recovery process can end up DoS'ing our Resource Manager, especially if we 
restart a large amount at once, as there can be thousands of tracked 
"completed" containers. We're also seeing issues with the servers running the 
Node Managers sometimes dying during the recovery process as well. It seems 
like there's multiple issues here but it mostly stems from keeping all 
containers for all time for active applications in the state store:
 * As part of the recovery process, the NM seems to send a "container released" 
message to the RM, which the RM just logs as "Thanks, I don't know what this 
container is though". This is what can cause DoS'ing of the RM
 * On the NM itself, it seems that part of the recovery process is actually 
trying to allocate resources for completed containers, resulting in the server 
running out of memory. We've only seen this a couple times so still trying to 
exactly track down what's happening. Our metrics show spikes of up to 100x the 
resources being used on the NM than the NM actually has resources (i.e. the NM 
is reporting terabytes of memory is allocated, but the node only has ~300 GiB 
of memory). The metrics might be a weird side effect of the recovery process 
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around 
the event passing system isn't easy to follow. So far I've just tracked this 
down for why containers are never removed from the state store until an 
application finishes. We use the rolling log aggregation so I'm currently 
trying to see if we can use that mechanism to release containers from the state 
store once the logs have been aggregated. But this would also be a non-issue if 
I could figure out why the other issues are happening and how to prevent them.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10863) CGroupElasticMemoryController is not work

2022-01-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470163#comment-17470163
 ] 

Adam Binford commented on YARN-10863:
-

Just hit this as well, you're forced to use strict memory control if you're 
using elastic memory control

> CGroupElasticMemoryController is not work
> -
>
> Key: YARN-10863
> URL: https://issues.apache.org/jira/browse/YARN-10863
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.3.1
>Reporter: LuoGe
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10863.001-1.patch, YARN-10863.002.patch, 
> YARN-10863.004.patch, YARN-10863.005.patch, YARN-10863.006.patch, 
> YARN-10863.007.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When following the 
> [documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCGroupsMemory.html]
>  configuring elastic memory resource control, 
> yarn.nodemanager.elastic-memory-control.enabled set true,  
> yarn.nodemanager.resource.memory.enforced set to false, 
> yarn.nodemanager.pmem-check-enabled set true, and 
> yarn.nodemanager.resource.memory.enabled set true to use cgroup control 
> memory, but elastic memory control is not work.
> I see the code ContainersMonitorImpl.java, in checkLimit function, the skip 
> logic have some problem.  The return condition is strictMemoryEnforcement is 
> true and elasticMemoryEnforcement is false. So, following the document set 
> use elastic memory control, the check logic will continue, when container 
> memory used over limit will killed by checkLimit. 
> {code:java}
> if (strictMemoryEnforcement && !elasticMemoryEnforcement) {
>   // When cgroup-based strict memory enforcement is used alone without
>   // elastic memory control, the oom-kill would take care of it.
>   // However, when elastic memory control is also enabled, the oom killer
>   // would be disabled at the root yarn container cgroup level (all child
>   // cgroups would inherit that setting). Hence, we fall back to the
>   // polling-based mechanism.
>   return;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org