[
https://issues.apache.org/jira/browse/YARN-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763670#comment-13763670
]
Zhijie Shen commented on YARN-1149:
-----------------------------------
Conducted some investigation on the problem:
1. The following transition seems to be unnecessary, because
APPLICATION_LOG_HANDLING_FINISHED can be emitted as early as after
APPLICATION_STARTED is handled, when Application is already at INITING.
{code}
+ .addTransition(ApplicationState.NEW, ApplicationState.FINISHED,
+ ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED,
+ new AppShutDownTransition())
{code}
2. The following message seems not to cover all the cases:
{code}
+ LOG.info("Application " + app.getAppId() +
+ " is shutted down since NodeManager has been killed.");
{code}
In the normal case, APPLICATION_LOG_HANDLING_FINISHED is emitted after
APPLICATION_FINISHED is handled, when Application is already at FINISHED. The
two exceptions are: 1. NM is stopping, the running log aggregation job is
signaled to stop early. In this case, this log info makes sense. 2. The running
log aggregation job is interrupted. See the following code:
{code}
while (!this.appFinishing.get()) {
synchronized(this) {
try {
wait(THREAD_SLEEP_TIME);
} catch (InterruptedException e) {
LOG.warn("PendingContainers queue is interrupted");
this.appFinishing.set(true);
}
}
}
{code}
In this case, the message seems not to be correct.
3. Should we do the following in AppShutDownTransition as well? This is because
APPLICATION_LOG_HANDLING_FINISHED is consumed, there'll not be the transition
from FINISHED->FINISHED on APPLICATION_LOG_HANDLING_FINISHED, and then the app
will always be in the context.
{code}
app.context.getApplications().remove(appId);
app.aclsManager.removeApplication(appId);
{code}
> NM throws InvalidStateTransitonException: Invalid event:
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING
> -----------------------------------------------------------------------------------------------------
>
> Key: YARN-1149
> URL: https://issues.apache.org/jira/browse/YARN-1149
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Ramya Sunil
> Assignee: Xuan Gong
> Fix For: 2.1.1-beta
>
> Attachments: YARN-1149.1.patch
>
>
> When nodemanager receives a kill signal when an application has finished
> execution but log aggregation has not kicked in,
> InvalidStateTransitonException: Invalid event:
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown
> {noformat}
> 2013-08-25 20:45:00,875 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just
> finished : application_1377459190746_0118
> 2013-08-25 20:45:00,876 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate
> log-file for app application_1377459190746_0118 at
> /app-logs/foo/logs/application_1377459190746_0118/<host>_45454.tmp
> 2013-08-25 20:45:00,876 INFO logaggregation.LogAggregationService
> (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation
> to complete for application_1377459190746_0118
> 2013-08-25 20:45:00,891 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for
> container container_1377459190746_0118_01_000004. Current good log dirs are
> /tmp/yarn/local
> 2013-08-25 20:45:00,915 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate
> log-file for app application_1377459190746_0118
> 2013-08-25 20:45:00,925 WARN application.Application
> (ApplicationImpl.java:handle(427)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2013-08-25 20:45:00,926 INFO application.Application
> (ApplicationImpl.java:handle(430)) - Application
> application_1377459190746_0118 transitioned from RUNNING to null
> 2013-08-25 20:45:00,927 WARN monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(463)) -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> is interrupted. Exiting.
> 2013-08-25 20:45:00,938 INFO ipc.Server (Server.java:stop(2437)) - Stopping
> server on 8040
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira