[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566890#comment-16566890
 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--------------------------------------------------------------

DEV NOTES: 
The initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to