[
https://issues.apache.org/jira/browse/YARN-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022203#comment-16022203
]
Rohith Sharma K S commented on YARN-6555:
-----------------------------------------
bq. Do you think we should preserve as much flow context information as
possible? The patch only stores flow context in the state store only if all
three fields of flow context is present. We could sanitize the flow context and
fill in default values for whatever field is missing and then just check if
flowcontext !=null before storing application state
There are 2 cents.
# IMO, we should NOT set default values for flow context. There are 2 cases,
## Master container launched : RM sets flow context in container launch context
and start it. This required to be recovered during NM restart.
## AM launches containers : Flow context details are not set. So, it is not
required to store and recover during NM restart and no use also.
# additional null check for strings before creating a proto is because setter
method for strings in proto throws NPE if flowName or flowVersion are null.
bq. FlowContext.toString(). Can we do something like {k1=v1, k2=v2, k3=v3} for
better readability in the log?
make sense, I will change it next patch after Vrushal review it.
> Enable flow context read (& corresponding write) for recovering application
> with NM restart
> --------------------------------------------------------------------------------------------
>
> Key: YARN-6555
> URL: https://issues.apache.org/jira/browse/YARN-6555
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: YARN-5355, YARN-5355-branch-2, 3.0.0-alpha3
> Reporter: Vrushali C
> Assignee: Rohith Sharma K S
> Attachments: YARN-6555.001.patch, YARN-6555.002.patch
>
>
> If timeline service v2 is enabled and NM is restarted with recovery enabled,
> then NM fails to start and throws an error as "flow context can't be null".
> This is happening because the flow context did not exist before but now that
> timeline service v2 is enabled, ApplicationImpl expects it to exist.
> This would also happen even if flow context existed before but since we are
> not persisting it / reading it during
> ContainerManagerImpl#recoverApplication, it does not get passed in to
> ApplicationImpl.
> full stack trace
> {code}
> 2017-05-03 21:51:52,178 FATAL
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting
> NodeManager
> java.lang.IllegalArgumentException: flow context cannot be null
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.<init>(ApplicationImpl.java:104)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.<init>(ApplicationImpl.java:90)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverApplication(ContainerManagerImpl.java:318)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:280)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:267)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:276)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:588)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:649)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]