[ https://issues.apache.org/jira/browse/YARN-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650823#comment-16650823 ]
Wangda Tan commented on YARN-8876: ---------------------------------- Thanks [~liuxun323], The termination of each jobs should be handled by AM, the Jira: https://issues.apache.org/jira/browse/YARN-8489 is targeted to solve the issue. I think we can keep the Jira open if we find any customized things need to be handled for TF job which is not handled by YARN-8489 > [Submarine] Job monitor long-running service of submarine > --------------------------------------------------------- > > Key: YARN-8876 > URL: https://issues.apache.org/jira/browse/YARN-8876 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Xun Liu > Assignee: Xun Liu > Priority: Major > > h1. Job monitor long-running service of submarine > After training, the monitoring program need auto close PS service. It is > possible that other deep learning frameworks also have some custom processing > when the tasks are in different states. > The submarine needs to provide a long-term resident service that monitors > each JOB mission. > This monitoring service can be processed differently according to the > training tasks of different depth learning framework types. > For example: Tensorflow performs distributed training, when the training is > completed, > The PS service cannot be automatically stopped. At this time, the PS needs to > be actively stopped by the monitoring service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org