[jira] [Commented] (YARN-9192) Deletion Taks will be picked up to delete running containers

2019-04-01 Thread Rayman (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807236#comment-16807236
 ] 

Rayman commented on YARN-9192:
--

[~sihai]
This is probably because you have set yarn.nodemanager.recovery.enabled to 
true, and 
yarn.nodemanager.recovery.supervised to false.

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html]

> Deletion Taks will be picked up to delete running containers
> 
>
> Key: YARN-9192
> URL: https://issues.apache.org/jira/browse/YARN-9192
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.9.1
>Reporter: Sihai Ke
>Priority: Major
>
> I suspect there is a bug in Yarn deletion task service, below is my repo 
> steps:
>  # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means 
> when the app finished, the Binary/container folder will be deleted after 3600 
> seconds.
>  # when the application App1 (long running service) is running on machine 
> machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be 
> called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and 
> ApplicationFinishEvent will be sent, and then some delection tasks will be 
> created, but be stored in DB and will be picked up to execute 3600 seconds.
>  # 100 seconds later, machine1 comes back, and the same app is assigned to 
> run this this machine, container created and works well.
>  # then deleting task created in step 2 will be picked up to delete 
> containers created in step 3 later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-9192) Deletion Taks will be picked up to delete running containers

2019-04-01 Thread Rayman (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rayman updated YARN-9192:
-
Comment: was deleted

(was: I'm observing a similar issue, when running Samza over YARN. 
 When bouncing an NM, the NM being killed writes LevelDB state for the 
deletion-service to act on. 
 The "new" NM reads it and acts upon it, but ends up deleting directories for 
running containers. 
 This happens when containers are long-running, and are placed on a fixed host.

I also observed this in the log 
*[INFO] [shutdown-hook-0] 
containermanager.ContainerManagerImpl.cleanUpApplicationsOnNMShutDown(ContainerManagerImpl.java:718)
 - Waiting for Applications to be Finished*)

> Deletion Taks will be picked up to delete running containers
> 
>
> Key: YARN-9192
> URL: https://issues.apache.org/jira/browse/YARN-9192
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.9.1
>Reporter: Sihai Ke
>Priority: Major
>
> I suspect there is a bug in Yarn deletion task service, below is my repo 
> steps:
>  # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means 
> when the app finished, the Binary/container folder will be deleted after 3600 
> seconds.
>  # when the application App1 (long running service) is running on machine 
> machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be 
> called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and 
> ApplicationFinishEvent will be sent, and then some delection tasks will be 
> created, but be stored in DB and will be picked up to execute 3600 seconds.
>  # 100 seconds later, machine1 comes back, and the same app is assigned to 
> run this this machine, container created and works well.
>  # then deleting task created in step 2 will be picked up to delete 
> containers created in step 3 later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9192) Deletion Taks will be picked up to delete running containers

2019-04-01 Thread Rayman (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807102#comment-16807102
 ] 

Rayman edited comment on YARN-9192 at 4/1/19 9:33 PM:
--

I'm observing a similar issue, when running Samza over YARN. 
 When bouncing an NM, the NM being killed writes LevelDB state for the 
deletion-service to act on. 
 The "new" NM reads it and acts upon it, but ends up deleting directories for 
running containers. 
 This happens when containers are long-running, and are placed on a fixed host.

I also observed this in the log 
*[INFO] [shutdown-hook-0] 
containermanager.ContainerManagerImpl.cleanUpApplicationsOnNMShutDown(ContainerManagerImpl.java:718)
 - Waiting for Applications to be Finished*


was (Author: rayman7718):
I'm observing a similar issue, when running Samza over YARN. 
When bouncing an NM, the NM being killed writes LevelDB state for the 
deletion-service to act on. 
The "new" NM reads it and acts upon it, but ends up deleting directories for 
running containers. 
This happens when containers are long-running, and are placed on a fixed host.

> Deletion Taks will be picked up to delete running containers
> 
>
> Key: YARN-9192
> URL: https://issues.apache.org/jira/browse/YARN-9192
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.9.1
>Reporter: Sihai Ke
>Priority: Major
>
> I suspect there is a bug in Yarn deletion task service, below is my repo 
> steps:
>  # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means 
> when the app finished, the Binary/container folder will be deleted after 3600 
> seconds.
>  # when the application App1 (long running service) is running on machine 
> machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be 
> called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and 
> ApplicationFinishEvent will be sent, and then some delection tasks will be 
> created, but be stored in DB and will be picked up to execute 3600 seconds.
>  # 100 seconds later, machine1 comes back, and the same app is assigned to 
> run this this machine, container created and works well.
>  # then deleting task created in step 2 will be picked up to delete 
> containers created in step 3 later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9192) Deletion Taks will be picked up to delete running containers

2019-04-01 Thread Rayman (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807102#comment-16807102
 ] 

Rayman commented on YARN-9192:
--

I'm observing a similar issue, when running Samza over YARN. 
When bouncing an NM, the NM being killed writes LevelDB state for the 
deletion-service to act on. 
The "new" NM reads it and acts upon it, but ends up deleting directories for 
running containers. 
This happens when containers are long-running, and are placed on a fixed host.

> Deletion Taks will be picked up to delete running containers
> 
>
> Key: YARN-9192
> URL: https://issues.apache.org/jira/browse/YARN-9192
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 2.9.1
>Reporter: Sihai Ke
>Priority: Major
>
> I suspect there is a bug in Yarn deletion task service, below is my repo 
> steps:
>  # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means 
> when the app finished, the Binary/container folder will be deleted after 3600 
> seconds.
>  # when the application App1 (long running service) is running on machine 
> machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be 
> called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and 
> ApplicationFinishEvent will be sent, and then some delection tasks will be 
> created, but be stored in DB and will be picked up to execute 3600 seconds.
>  # 100 seconds later, machine1 comes back, and the same app is assigned to 
> run this this machine, container created and works well.
>  # then deleting task created in step 2 will be picked up to delete 
> containers created in step 3 later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2019-02-20 Thread Rayman (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773513#comment-16773513
 ] 

Rayman commented on YARN-3554:
--

The RetryUpToMaximumTimeWithFixedSleep policy takes as input a maxTime and a 
sleepTime.
and internally is implemented as a RetryUpToMaximumCountWithFixedSleep with 
maxCount =  maxTime / sleepTime. 

This has a problem
It does not account for the time spent while performing the actual retry. For 
example, 
RetryUpToMaximumTimeWithFixedSleep with maxTime = 30 sec and sleepTime = 1sec. 
Will takeupto 90 seconds, if each retry (e.g., ConnectionTimeout) takes 2 
seconds to return. 
30 * (2 +1). 

A policy claiming to be RetryUpToMaximumTimeWithFixedSleep, should *actually* 
respect the *maximum time*, e.g., by recording a timestamp/timer.

> Default value for maximum nodemanager connect wait time is too high
> ---
>
> Key: YARN-3554
> URL: https://issues.apache.org/jira/browse/YARN-3554
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Naganarasimha G R
>Priority: Major
>  Labels: BB2015-05-RFC, newbie
> Fix For: 2.8.0, 2.7.1, 2.6.2, 3.0.0-alpha1
>
> Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch
>
>
> The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
> msec or 15 minutes, which is way too high.  The default container expiry time 
> from the RM and the default task timeout in MapReduce are both only 10 
> minutes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org