[
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brahma Reddy Battula updated YARN-9877:
---------------------------------------
Target Version/s: 3.4.0 (was: 3.0.4, 3.3.0, 3.2.2, 3.1.4)
Bulk update: moved all 3.3.0 non-blocker issues, please move back if it is a
blocker.
> Intermittent TIME_OUT of LogAggregationReport
> ---------------------------------------------
>
> Key: YARN-9877
> URL: https://issues.apache.org/jira/browse/YARN-9877
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation, resourcemanager, yarn
> Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
> Reporter: Adam Antal
> Assignee: Adam Antal
> Priority: Major
> Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're
> allocated - the AM realizes it doesn't need them. The container's state
> changes are: ALLOCATED -> ACQUIRED -> RELEASED.
> Let's suppose these extra containers are allocated in a different node from
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING
> state.
> - After a while the final log aggregation status for the app changes to
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s
> transition function is called. This calls the
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the
> "NOT_START" LogAggregationStatus associated with this NodeId for the app,
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the
> NodeManager because it does not have any running container on it (Note that
> the AM immediately released them after acquisition). The LogAggregationStatus
> remains NOT_START until time out is reached. After that point the RM
> aggregates the LogAggregationReports for all the nodes, and though all the
> containers have SUCCEEDED state, one particular node has NOT_START, so the
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate
> based on a year's 1500 run on an unstressed cluster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]