[ 
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948811#comment-16948811
 ] 

Adam Antal commented on YARN-9877:
----------------------------------

Some comment for the patch: 
- I didn't want to add new event/signals to the communication between RMApp and 
RMContainer, so I added a new field to {{RMAppRunningOnNodeEvent}} which is 
only set if it's triggered from an {{AcquiredTransition}} in 
{{RMContainerImpl}}.

There's a concept in RM "an app is running on a node" 
({{RMAppRunningOnNodeEvent}}) which is triggered on the following cases:
- recovering a container
- a node is reconnected with a running container associated with the application
- a status update is coming in from a node having a reference to a container 
which belongs to the application
- an AM acquires a container in a new node

We don't want to bother any of these cases, but the last is one is problematic 
unfortunately. The reasoning: a container is acquired by the AM but then it's 
got released. When the acquisition happens, we consider the app "running on 
that node", so a "NOT_START" default log aggregation status is associated with 
that node - this is what we should avoid. I think it is sufficient if we add a 
default log aggregation status if a container is mentioned a Node Heartbeat. It 
would happen a bit later than the acquisition, but it safer to assume that the 
container is running at that point.
I should also note, that the internal data structure of 
{{RMAppImpl$logAggregation}} which is a {{RMAppLogAggregation}} object can 
handle the case when the log aggregation status was not set beforehand, though 
in short time period when the container is not sent along the NM heartbeat the 
RM would assume nothing about the log aggregation status (so for example in the 
RM UI Log Aggregation Status will be N/A instead of NOT_START), but I think we 
can live along with this limitation.

I hope someone with enough tech-depth can take a look at this reasoning and 
also review the patch, maybe [[email protected]], [~Prabhu Joseph]. Would 
be appreciated a lot.

> Intermittent TIME_OUT of LogAggregationReport
> ---------------------------------------------
>
>                 Key: YARN-9877
>                 URL: https://issues.apache.org/jira/browse/YARN-9877
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, resourcemanager, yarn
>    Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based 
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep 
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're 
> allocated - the AM realizes it doesn't need them. The container's state 
> changes are: ALLOCATED -> ACQUIRED -> RELEASED. 
> Let's suppose these extra containers are allocated in a different node from 
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully 
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING 
> state.
> - After a while the final log aggregation status for the app changes to 
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's 
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s 
> transition function is called. This calls the 
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the 
> "NOT_START" LogAggregationStatus associated with this NodeId for the app, 
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the 
> NodeManager because it does not have any running container on it (Note that 
> the AM immediately released them after acquisition). The LogAggregationStatus 
> remains NOT_START until time out is reached. After that point the RM 
> aggregates the LogAggregationReports for all the nodes, and though all the 
> containers have SUCCEEDED state, one particular node has NOT_START, so the 
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always 
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate 
> based on a year's 1500 run on an unstressed cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to