[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-08-02 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898599#comment-16898599
 ] 

Steven Rand edited comment on YARN-4946 at 8/2/19 6:17 AM:
---

I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 
20 minutes, whereas before it took less than one minute.

I checked the RM's logs, and noticed that it hits the code path added in this 
patch more than 18 million times
{code:java}
# The log rotation settings allow for only 20 log files, so actually this 
number is lower than the real count.
$ grep 'but not removing' hadoop--resourcemanager-.log* | 
wc -l
18092893
{code}
I checked in ZK, and according to {{./zkCli.sh ls 
/rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, 
even though the configured max is 1,000.

I think that what happens when RM recovery starts is:
 * Some number of apps in the state store cause us to handle an 
{{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – 
presumably just those that are finished?
 * Each time we handle one of these events, we call 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, 
and in both cases we realize that there are more apps both in ZK and in memory 
than is allowed (limit for both is 1,000).
 * So for each of these events, we go through the for loops in both 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} 
that try to remove apps from ZK and from memory.
 * For whatever reason – probably a separate issue on this cluster – log 
aggregation isn't complete for any of these apps. So the for loops never manage 
to delete apps. And since the for loops are deterministic, they try to delete 
the same apps every time, but never make progress.

And I think the repetition of these for loops for each {{APP_COMPLETED}} event 
explains the 18 million number – if we can have at most 9,755 finished apps in 
the state store, and for each of those apps we trigger 2 for loops that can 
have at most 8,755 iterations, we very quickly wind up with a lot of iterations.

Because this change can lead to much longer RM recovery times in circumstances 
like this one, I think I prefer option {{a}} from the two listed above.

Or, I think it's also reasonable to modify the patch from YARN-9571 to have a 
hardcoded TTL.


was (Author: steven rand):
I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 
20 minutes, whereas before it took less than one minute.

I checked the RM's logs, and noticed that it hits the code path added in this 
patch more than 18 million times
{code:java}
# The log rotation settings allow for only 20 log files, so actually this 
number is lower than the real count.
$ grep 'but not removing' hadoop-palantir-resourcemanager-.log* | wc 
-l
18092893
{code}
I checked in ZK, and according to {{./zkCli.sh ls 
/rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, 
even though the configured max is 1,000.

I think that what happens when RM recovery starts is:
 * Some number of apps in the state store cause us to handle an 
{{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – 
presumably just those that are finished?
 * Each time we handle one of these events, we call 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, 
and in both cases we realize that there are more apps both in ZK and in memory 
than is allowed (limit for both is 1,000).
 * So for each of these events, we go through the for loops in both 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} 
that try to remove apps from ZK and from memory.
 * For whatever reason – probably a separate issue on this cluster – log 
aggregation isn't complete for any of these apps. So the for loops never manage 
to delete apps. And since the for loops are deterministic, they try to delete 
the same apps every time, but never make progress.

And I think the repetition of these for loops for each {{APP_COMPLETED}} event 
explains the 18 million number – if we can have at most 9,755 finished apps in 
the state store, and for each of those apps we trigger 2 for loops that can 
have at most 8,755 iterations, we very quickly wind up with a lot of iterations.

Because this change can lead to much longer RM recovery times in circumstances 
like this one, I think I prefer option {{a}} from the two listed above.

Or, I think it's also reasonable to modify the patch from YARN-9571 to have a 
hardcoded TTL.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue 

[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2018-08-02 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566890#comment-16566890
 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:27 PM:
--

DEV NOTES: 
The initial implementation could have looked like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
The initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed 

[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2018-08-02 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566890#comment-16566890
 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--

DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl.aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there is any log aggregation reports then 
RmNode.handleLogAggregationStatus is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, I run all the code in {{FinalTransition}} as before 
and if log aggregation is not finished yet, I don't send the APP_COMPLETED 
event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl.aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. 

[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2018-08-02 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566890#comment-16566890
 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--

DEV NOTES: 
The initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed 

[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2018-08-02 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566890#comment-16566890
 ] 

Szilard Nemeth edited comment on YARN-4946 at 8/2/18 3:26 PM:
--

DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl#aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.



was (Author: snemeth):
DEV NOTES: 
An initial implementation could have looked it like this: 
The very first line of transition should be to check whether log aggregation is 
finished. 
If it doesn't, don't do anything and break from the method.

To make sure apps become completed if log aggregation is finished, the 
APP_COMPLETED event need to be dispatched when log aggregation finishes.
In my understanding, this is the sequence of events:
1. RM receives NM heartbeat in ResourceTrackerService.nodeUpdate
2. An RmNodeEvent is created with type STATUS_UPDATE
3. RmNodeImpl.StatusUpdateWhenHealthyTransition.transition handles the node 
status update
4. If there are any log aggregation reports then 
{{RmNode#handleLogAggregationStatus}} is called
5. This ultimately calls rmApp.aggregateLogReport

In rmApp.aggregateLogReport, I needed to check whether log aggregation finished 
and then send the APP_COMPLETED event.

An issue with this approach:
If a {{FinalTransition}} runs because of the app got killed, finished or 
rejected, e.g. RMAppImpl goes from the RUNNING to FINISHED state 
(RMAppEventType.ATTEMPT_FINISHED), no matter what happens in 
{{FinalTransition}}, the app will reach a terminal state (FINISHED in this 
case).
If I would use a break statement as described above, the app would be in a 
FINISHED state which is not right as the rest of the code in the transition 
could not run again.
So with my implementation, all the code in {{FinalTransition}} runs like as 
before and if log aggregation is not finished yet, I don't send the 
APP_COMPLETED event to the {{RMAppManager}}.
When the log aggregation is finished for an app, 
{{RMAppImpl.aggregateLogReport}} will be called. 
In this method, I added a piece of code that sends the APP_COMPLETED event to 
the {{RMAppManager}} if the application is in a final state.


> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed