[ https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16601938#comment-16601938 ]
Tao Yang commented on YARN-8729: -------------------------------- Thanks [~ebadger], [~cheersyang] for your reply. {quote} Setting isStopped to false before the whole NM startup has finished would introduce a race condition where the NM says it's running, but it isn't fully up yet. {quote} I think isStopped here is used as a switch which can control async updater thread to run or stop, not a state, so that it's reasonable to set isStopped to be false before starting the async updater thread, just as setting isStopped to be true to stop the updater thread in current codes. {quote} I'm not sure if there is a functional reason, since the statusUpdater thread loops on isStopped until it's false. {quote} The updater thread will exit if isStopped is true instead of waiting. Related codes in StatusUpdaterRunnable: {noformat} public void run() { int lastHeartbeatID = 0; while (!isStopped) { ... } } {noformat} {quote} could you please upload another patch to trigger jenkins? {quote} Done. > Node status updater thread could be lost after it restarted > ----------------------------------------------------------- > > Key: YARN-8729 > URL: https://issues.apache.org/jira/browse/YARN-8729 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 3.2.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Attachments: YARN-8729.001.patch, YARN-8729.001.patch > > > Today I found a lost NM whose node status updater thread was not exist after > this thread restarted. In > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped > flag is not updated to be false before executing {{statusUpdater.start()}}, > so that if the thread is immediately started and found isStopped==true, it > will exit without any log. > Key codes in > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}: > {code:java} > statusUpdater.join(); > registerWithRM(); > statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater"); > statusUpdater.start(); > this.isStopped = false; //this line should be moved before > statusUpdater.start(); > LOG.info("NodeStatusUpdater thread is reRegistered and restarted"); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org