[
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737571#comment-17737571
]
ASF GitHub Bot commented on YARN-9877:
--------------------------------------
K0K0V0K opened a new pull request, #5784:
URL: https://github.com/apache/hadoop/pull/5784
In case of ACQUIRED -> RELEASED transition the LOG aggregation will time out
for container.
(Based on Adam Antal works, thanks for it.)
### Description of PR
### How was this patch tested?
### For code changes:
- [ ] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> Intermittent TIME_OUT of LogAggregationReport
> ---------------------------------------------
>
> Key: YARN-9877
> URL: https://issues.apache.org/jira/browse/YARN-9877
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation, resourcemanager, yarn
> Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
> Reporter: Adam Antal
> Assignee: Adam Antal
> Priority: Major
> Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're
> allocated - the AM realizes it doesn't need them. The container's state
> changes are: ALLOCATED -> ACQUIRED -> RELEASED.
> Let's suppose these extra containers are allocated in a different node from
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING
> state.
> - After a while the final log aggregation status for the app changes to
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s
> transition function is called. This calls the
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the
> "NOT_START" LogAggregationStatus associated with this NodeId for the app,
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the
> NodeManager because it does not have any running container on it (Note that
> the AM immediately released them after acquisition). The LogAggregationStatus
> remains NOT_START until time out is reached. After that point the RM
> aggregates the LogAggregationReports for all the nodes, and though all the
> containers have SUCCEEDED state, one particular node has NOT_START, so the
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate
> based on a year's 1500 run on an unstressed cluster.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]