[ 
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744545#comment-17744545
 ] 

ASF GitHub Bot commented on YARN-9877:
--------------------------------------

K0K0V0K commented on code in PR #5784:
URL: https://github.com/apache/hadoop/pull/5784#discussion_r1267902588


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java:
##########
@@ -1088,8 +1088,15 @@ public void transition(RMAppImpl app, RMAppEvent event) {
       // otherwise, add it to ranNodes for further process
       app.ranNodes.add(nodeAddedEvent.getNodeId());
 
-      app.logAggregation.addReportIfNecessary(
-          nodeAddedEvent.getNodeId(), app.getApplicationId());
+      if (!nodeAddedEvent.isCreatedFromAcquiredState()) {
+        app.logAggregation.addReportIfNecessary(
+            nodeAddedEvent.getNodeId(), app.getApplicationId());
+      } else {
+        LOG.warn(String.format("Not considering container for log aggregation "
+                + "while app is transitioning from ACQUIRED directly to 
RELEASED "
+                + "for nodeId: %s and appId: %s",
+            nodeAddedEvent.getNodeId(), app.getApplicationId()));
+      }

Review Comment:
   I agree with both.
   
   However i added the log message to the code cause i think this if solution 
is a bit fishy, and if some one doing a debug here, maybe can cause troubles, 
but the warning level is too high, i will change it to debug.





> Intermittent TIME_OUT of LogAggregationReport
> ---------------------------------------------
>
>                 Key: YARN-9877
>                 URL: https://issues.apache.org/jira/browse/YARN-9877
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, resourcemanager, yarn
>    Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based 
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep 
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're 
> allocated - the AM realizes it doesn't need them. The container's state 
> changes are: ALLOCATED -> ACQUIRED -> RELEASED. 
> Let's suppose these extra containers are allocated in a different node from 
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully 
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING 
> state.
> - After a while the final log aggregation status for the app changes to 
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's 
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s 
> transition function is called. This calls the 
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the 
> "NOT_START" LogAggregationStatus associated with this NodeId for the app, 
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the 
> NodeManager because it does not have any running container on it (Note that 
> the AM immediately released them after acquisition). The LogAggregationStatus 
> remains NOT_START until time out is reached. After that point the RM 
> aggregates the LogAggregationReports for all the nodes, and though all the 
> containers have SUCCEEDED state, one particular node has NOT_START, so the 
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always 
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate 
> based on a year's 1500 run on an unstressed cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to