[ https://issues.apache.org/jira/browse/YARN-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446382#comment-16446382 ]
Eric Yang commented on YARN-7939: --------------------------------- By reverting YARN-7973, the error messages disappeared, and I see the container started a new instance, and running. However, existing instance is not shutdown. AM's log doesn't show new container has been allocated, RM also doesn't show new container is allocated. I see this on the node: {code} hbase 8413 0.0 0.0 15060 1500 ? Ss 17:45 0:00 /bin/bash -c sleep 90000 1>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524245796717_0002/container_1524245796717_0002_01_000004/stdout.txt 2>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524245796717_0002/container_1524245796717_0002_01_000004/stderr.txt hbase 8435 0.0 0.0 7712 604 ? S 17:45 0:00 sleep 90000 hbase 8820 0.0 0.0 115244 1460 ? Ss 20:21 0:00 /bin/bash -c sleep 1200000 1>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524248642708_0001/container_1524248642708_0001_01_000002/stdout.txt 2>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524248642708_0001/container_1524248642708_0001_01_000002/stderr.txt {code} The current implementation AM is only being notified of changes after operation are done. If the change was not successful or something fail in the middle, then AM is stuck in a component instance upgrade. We might need a timer to measure from the point container is instructed to perform upgrade, and wait for a timeout value. If the stop and start does not come back with reasonable timeframe, a new instance should be launched to replace the lost instance. This will avoid getting stuck in middle if node manager did not report back with successful state, or node manager was lost during upgrade. This can increase robustness of the upgrade framework, and solve the problem that I encountered. > Yarn Service Upgrade: add support to upgrade a component instance > ------------------------------------------------------------------ > > Key: YARN-7939 > URL: https://issues.apache.org/jira/browse/YARN-7939 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Chandni Singh > Assignee: Chandni Singh > Priority: Major > Attachments: YARN-7939.001.patch, YARN-7939.002.patch, > YARN-7939.003.patch, YARN-7939.004.patch, YARN-7939.005.patch, > YARN-7939.006.patch, YARN-7939.007.patch, YARN-7939.008.patch, serviceam.log > > > Yarn core supports in-place upgrade of containers. A yarn service can > leverage that to provide in-place upgrade of component instances. Please see > YARN-7512 for details. > Will add support to upgrade a single component instance first and then > iteratively add other APIs and features. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org