[
https://issues.apache.org/jira/browse/YARN-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702456#comment-13702456
]
Omkar Vinit Joshi commented on YARN-299:
----------------------------------------
I guess the patch looks good overall .. however we need an additional fix which
might also occur. The root cause for this is more evident in YARN-820 logs..
Container is requesting multiple resources and RESOURCE_LOCALIZED /
RESOURCE_FAILED events might occur for one more more resources between
container received first RESOURCE_FAILED event and it deregister itself from
remaining resources...therefore we might see RESOURCE_FAILED /
RESOURCE_LOCALIZED events sent to containerImpl when resource is in DONE state
(for different resources).... Therefore like RESOURCE_FAILED we should also
ignore RESOURCE_LOCALIZED event.
I could see one more issue in the logs... it would be great if we fix that too
as a part of this jira.... looks like a quick change... here in LOG.info it is
calling toString on LocalizedResource which is not threadsafe for ref
(LinkedList used internally). I guess grabbing writelock inside toString will
protect it from such exceptions.. we need to check other state machines as well.
{code}
} catch (ExecutionException e) {
LOG.info("Failed to download rsrc " + assoc.getResource(),
e.getCause());
LocalResourceRequest req = assoc.getResource().getRequest();
publicRsrc.handle(new ResourceFailedLocalizationEvent(req,
e.getMessage()));
assoc.getResource().unlock();
{code}
any thoughts?
> Node Manager throws
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> RESOURCE_FAILED at DONE
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-299
> URL: https://issues.apache.org/jira/browse/YARN-299
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.0.1-alpha, 2.0.0-alpha
> Reporter: Devaraj K
> Assignee: Mayank Bansal
> Attachments: YARN-299-trunk-1.patch
>
>
> {code:xml}
> 2012-12-31 10:36:27,844 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Can't handle this event at current state: Current: [DONE], eventType:
> [RESOURCE_FAILED]
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> RESOURCE_FAILED at DONE
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:819)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:504)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:497)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:662)
> 2012-12-31 10:36:27,845 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1356792558130_0002_01_000001 transitioned from DONE to
> null
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira