[jira] [Commented] (YARN-9192) Deletion Taks will be picked up to delete running containers
[ https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807236#comment-16807236 ] Rayman commented on YARN-9192: -- [~sihai] This is probably because you have set yarn.nodemanager.recovery.enabled to true, and yarn.nodemanager.recovery.supervised to false. [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html] > Deletion Taks will be picked up to delete running containers > > > Key: YARN-9192 > URL: https://issues.apache.org/jira/browse/YARN-9192 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Affects Versions: 2.9.1 >Reporter: Sihai Ke >Priority: Major > > I suspect there is a bug in Yarn deletion task service, below is my repo > steps: > # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means > when the app finished, the Binary/container folder will be deleted after 3600 > seconds. > # when the application App1 (long running service) is running on machine > machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be > called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and > ApplicationFinishEvent will be sent, and then some delection tasks will be > created, but be stored in DB and will be picked up to execute 3600 seconds. > # 100 seconds later, machine1 comes back, and the same app is assigned to > run this this machine, container created and works well. > # then deleting task created in step 2 will be picked up to delete > containers created in step 3 later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-9192) Deletion Taks will be picked up to delete running containers
[ https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rayman updated YARN-9192: - Comment: was deleted (was: I'm observing a similar issue, when running Samza over YARN. When bouncing an NM, the NM being killed writes LevelDB state for the deletion-service to act on. The "new" NM reads it and acts upon it, but ends up deleting directories for running containers. This happens when containers are long-running, and are placed on a fixed host. I also observed this in the log *[INFO] [shutdown-hook-0] containermanager.ContainerManagerImpl.cleanUpApplicationsOnNMShutDown(ContainerManagerImpl.java:718) - Waiting for Applications to be Finished*) > Deletion Taks will be picked up to delete running containers > > > Key: YARN-9192 > URL: https://issues.apache.org/jira/browse/YARN-9192 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Affects Versions: 2.9.1 >Reporter: Sihai Ke >Priority: Major > > I suspect there is a bug in Yarn deletion task service, below is my repo > steps: > # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means > when the app finished, the Binary/container folder will be deleted after 3600 > seconds. > # when the application App1 (long running service) is running on machine > machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be > called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and > ApplicationFinishEvent will be sent, and then some delection tasks will be > created, but be stored in DB and will be picked up to execute 3600 seconds. > # 100 seconds later, machine1 comes back, and the same app is assigned to > run this this machine, container created and works well. > # then deleting task created in step 2 will be picked up to delete > containers created in step 3 later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9192) Deletion Taks will be picked up to delete running containers
[ https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807102#comment-16807102 ] Rayman edited comment on YARN-9192 at 4/1/19 9:33 PM: -- I'm observing a similar issue, when running Samza over YARN. When bouncing an NM, the NM being killed writes LevelDB state for the deletion-service to act on. The "new" NM reads it and acts upon it, but ends up deleting directories for running containers. This happens when containers are long-running, and are placed on a fixed host. I also observed this in the log *[INFO] [shutdown-hook-0] containermanager.ContainerManagerImpl.cleanUpApplicationsOnNMShutDown(ContainerManagerImpl.java:718) - Waiting for Applications to be Finished* was (Author: rayman7718): I'm observing a similar issue, when running Samza over YARN. When bouncing an NM, the NM being killed writes LevelDB state for the deletion-service to act on. The "new" NM reads it and acts upon it, but ends up deleting directories for running containers. This happens when containers are long-running, and are placed on a fixed host. > Deletion Taks will be picked up to delete running containers > > > Key: YARN-9192 > URL: https://issues.apache.org/jira/browse/YARN-9192 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Affects Versions: 2.9.1 >Reporter: Sihai Ke >Priority: Major > > I suspect there is a bug in Yarn deletion task service, below is my repo > steps: > # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means > when the app finished, the Binary/container folder will be deleted after 3600 > seconds. > # when the application App1 (long running service) is running on machine > machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be > called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and > ApplicationFinishEvent will be sent, and then some delection tasks will be > created, but be stored in DB and will be picked up to execute 3600 seconds. > # 100 seconds later, machine1 comes back, and the same app is assigned to > run this this machine, container created and works well. > # then deleting task created in step 2 will be picked up to delete > containers created in step 3 later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9192) Deletion Taks will be picked up to delete running containers
[ https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807102#comment-16807102 ] Rayman commented on YARN-9192: -- I'm observing a similar issue, when running Samza over YARN. When bouncing an NM, the NM being killed writes LevelDB state for the deletion-service to act on. The "new" NM reads it and acts upon it, but ends up deleting directories for running containers. This happens when containers are long-running, and are placed on a fixed host. > Deletion Taks will be picked up to delete running containers > > > Key: YARN-9192 > URL: https://issues.apache.org/jira/browse/YARN-9192 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Affects Versions: 2.9.1 >Reporter: Sihai Ke >Priority: Major > > I suspect there is a bug in Yarn deletion task service, below is my repo > steps: > # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means > when the app finished, the Binary/container folder will be deleted after 3600 > seconds. > # when the application App1 (long running service) is running on machine > machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be > called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and > ApplicationFinishEvent will be sent, and then some delection tasks will be > created, but be stored in DB and will be picked up to execute 3600 seconds. > # 100 seconds later, machine1 comes back, and the same app is assigned to > run this this machine, container created and works well. > # then deleting task created in step 2 will be picked up to delete > containers created in step 3 later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773513#comment-16773513 ] Rayman commented on YARN-3554: -- The RetryUpToMaximumTimeWithFixedSleep policy takes as input a maxTime and a sleepTime. and internally is implemented as a RetryUpToMaximumCountWithFixedSleep with maxCount = maxTime / sleepTime. This has a problem It does not account for the time spent while performing the actual retry. For example, RetryUpToMaximumTimeWithFixedSleep with maxTime = 30 sec and sleepTime = 1sec. Will takeupto 90 seconds, if each retry (e.g., ConnectionTimeout) takes 2 seconds to return. 30 * (2 +1). A policy claiming to be RetryUpToMaximumTimeWithFixedSleep, should *actually* respect the *maximum time*, e.g., by recording a timestamp/timer. > Default value for maximum nodemanager connect wait time is too high > --- > > Key: YARN-3554 > URL: https://issues.apache.org/jira/browse/YARN-3554 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Naganarasimha G R >Priority: Major > Labels: BB2015-05-RFC, newbie > Fix For: 2.8.0, 2.7.1, 2.6.2, 3.0.0-alpha1 > > Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch > > > The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 > msec or 15 minutes, which is way too high. The default container expiry time > from the RM and the default task timeout in MapReduce are both only 10 > minutes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org