[ 
https://issues.apache.org/jira/browse/YARN-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446382#comment-16446382
 ] 

Eric Yang commented on YARN-7939:
---------------------------------

By reverting YARN-7973, the error messages disappeared, and I see the container 
started a new instance, and running.  However, existing instance is not 
shutdown.

AM's log doesn't show new container has been allocated, RM also doesn't show 
new container is allocated.  I see this on the node:

{code}
hbase     8413  0.0  0.0  15060  1500 ?        Ss   17:45   0:00 /bin/bash -c 
sleep 90000 
1>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524245796717_0002/container_1524245796717_0002_01_000004/stdout.txt
 
2>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524245796717_0002/container_1524245796717_0002_01_000004/stderr.txt
 
hbase     8435  0.0  0.0   7712   604 ?        S    17:45   0:00 sleep 90000
hbase     8820  0.0  0.0 115244  1460 ?        Ss   20:21   0:00 /bin/bash -c 
sleep 1200000 
1>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524248642708_0001/container_1524248642708_0001_01_000002/stdout.txt
 
2>/usr/local/hadoop-3.2.0-SNAPSHOT/logs/userlogs/application_1524248642708_0001/container_1524248642708_0001_01_000002/stderr.txt
 
{code}

The current implementation AM is only being notified of changes after operation 
are done.  If the change was not successful or something fail in the middle, 
then AM is stuck in a component instance upgrade.  We might need a timer to 
measure from the point container is instructed to perform upgrade, and wait for 
a timeout value.  If the stop and start does not come back with reasonable 
timeframe, a new instance should be launched to replace the lost instance.  
This will avoid getting stuck in middle if node manager did not report back 
with successful state, or node manager was lost during upgrade.  This can 
increase robustness of the upgrade framework, and solve the problem that I 
encountered.

> Yarn Service Upgrade: add support to upgrade a component instance 
> ------------------------------------------------------------------
>
>                 Key: YARN-7939
>                 URL: https://issues.apache.org/jira/browse/YARN-7939
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>         Attachments: YARN-7939.001.patch, YARN-7939.002.patch, 
> YARN-7939.003.patch, YARN-7939.004.patch, YARN-7939.005.patch, 
> YARN-7939.006.patch, YARN-7939.007.patch, YARN-7939.008.patch, serviceam.log
>
>
> Yarn core supports in-place upgrade of containers. A yarn service can 
> leverage that to provide in-place upgrade of component instances. Please see 
> YARN-7512 for details.
> Will add support to upgrade a single component instance first and then 
> iteratively add other APIs and features.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to