[
https://issues.apache.org/jira/browse/YARN-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Morty Zhong updated YARN-9380:
------------------------------
Description:
FederationInterceptor will recover the map of containerId to subClusterId(field
named
containerIdToSubClusterIdMap) by getting containers from RMs(home and secondary
RM) when recover is enabled.However, this may fail in follow condition(RM NM
both restart):
# RM is restart(recover is enabled),recover tokens, apps, but no
containers(waiting NM reporting containers when rsync)
# RM waiting NM rsync, but before NM rsync, NM is restart.
# before NM rsync to RM, NM recover itself, and FederationInterceptor pull
containers from RM(RM has no containers in this moment) and will return
containers without the containers from NM that hasn`t rsync with RM
maybe the containerId to subClusterId map store in NMStateStore can solve this?
was:
FederationInterceptor will recover the map of containerId to subClusterId(field
named
containerIdToSubClusterIdMap) by getting containers from RMs(home and secondary
RM) when recover is enabled.However, this may fail in follow condition(RM NM
both restart):
# RM is restart(recover is enabled),recover tokens, apps, but no
containers(waiting NM reporting containers when rsync)
# RM waiting NM rsync, but before NM rsync, NM is restart.
# before NM rsync to RM, NM recover itself, and FederationInterceptor pull
containers from RM(RM has no containers in this moment) and will return empty
list
maybe the containerId to subClusterId map store in NMStateStore can solve this?
Summary: FederationInterceptor get Containers from RM may return not
all the containers when RM/NM restart (was: FederationInterceptor get
Containers from RM may return empty list when RM/NM restart)
> FederationInterceptor get Containers from RM may return not all the
> containers when RM/NM restart
> -------------------------------------------------------------------------------------------------
>
> Key: YARN-9380
> URL: https://issues.apache.org/jira/browse/YARN-9380
> Project: Hadoop YARN
> Issue Type: Bug
> Components: federation
> Reporter: Morty Zhong
> Priority: Major
>
> FederationInterceptor will recover the map of containerId to
> subClusterId(field named
> containerIdToSubClusterIdMap) by getting containers from RMs(home and
> secondary RM) when recover is enabled.However, this may fail in follow
> condition(RM NM both restart):
> # RM is restart(recover is enabled),recover tokens, apps, but no
> containers(waiting NM reporting containers when rsync)
> # RM waiting NM rsync, but before NM rsync, NM is restart.
> # before NM rsync to RM, NM recover itself, and FederationInterceptor pull
> containers from RM(RM has no containers in this moment) and will return
> containers without the containers from NM that hasn`t rsync with RM
> maybe the containerId to subClusterId map store in NMStateStore can solve
> this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]