Adam Antal created YARN-9877:
--------------------------------

             Summary: Intermittent TIME_OUT of LogAggregationReport
                 Key: YARN-9877
                 URL: https://issues.apache.org/jira/browse/YARN-9877
             Project: Hadoop YARN
          Issue Type: Bug
          Components: log-aggregation, resourcemanager, yarn
    Affects Versions: 3.1.3, 3.2.1, 3.0.3, 3.3.0
            Reporter: Adam Antal


I noticed some intermittent TIME_OUT in some downstream log-aggregation based 
tests.

Steps to reproduce:
- Let's run a MR job
{code}
hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep 
-Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000

{code}
- Suppose the AM is requesting more containers, but as soon as they're 
allocated - the AM realizes it doesn't need them. The container's state changes 
are: ALLOCATED -> ACQUIRED -> RELEASED. 
Let's suppose these extra containers are allocated in a different node from the 
other 21 (AM + 10 mapper + 10 reducer) containers' node.
- All the containers finish successfully and the app is finished successfully 
as well. Log aggregation status for the whole app seemingly stucks in RUNNING 
state.
- After a while the final log aggregation status for the app changes to 
TIME_OUT.

Root cause:
- As unused containers are getting through the state transition in the RM's 
internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s transition 
function is called. This calls the {{RMAppLogAggregation$addReportIfNecessary}} 
which forcefully adds the "NOT_START" LogAggregationStatus associated with this 
NodeId for the app, even though it does not have any running container on it.
- The node's LogAggregationStatus is never updated to "SUCCEEDED" by the 
NodeManager because it does not have any running container on it (Note that the 
AM immediately released them after acquisition). The LogAggregationStatus 
remains NOT_START until time out is reached. After that point the RM aggregates 
the LogAggregationReports for all the nodes, and though all the containers have 
SUCCEEDED state, one particular node has NOT_START, so the final log 
aggregation will be TIME_OUT.
(I crawled the RM UI for the log aggregation statuses, and it was always 
NOT_START for this particular node).

This situation is highly unlikely, but has an estimated ~0.8% of failure rate 
based on a year's 1500 run on an unstressed cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to