[jira] [Commented] (YARN-3161) Containers' information are lost in some cases when RM restart

2015-02-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313383#comment-14313383 ] sandflee commented on YARN-3161: if the NM machine crashes while RM restart, it seems we'll

[jira] [Updated] (YARN-3327) if NMClientAsync stopContainer failed because of IOException, there's no chance to stopContainer again

2015-03-10 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3327: --- Summary: if NMClientAsync stopContainer failed because of IOException, there's no chance to stopContainer

[jira] [Created] (YARN-3327) if NMClientAsync stopContainer failed because of IOException, there's no change to stopContainer again

2015-03-10 Thread sandflee (JIRA)
sandflee created YARN-3327: -- Summary: if NMClientAsync stopContainer failed because of IOException, there's no change to stopContainer again Key: YARN-3327 URL: https://issues.apache.org/jira/browse/YARN-3327

[jira] [Commented] (YARN-3329) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-10 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355064#comment-14355064 ] sandflee commented on YARN-3329: the same to YARN-3328, close it There's no way to

[jira] [Updated] (YARN-3328) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-10 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3328: --- Description: If work preserving is enabled and AM restart, AM could't stop containers launched by pre-am,

[jira] [Commented] (YARN-3328) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356384#comment-14356384 ] sandflee commented on YARN-3328: Is there any necessary to keep containers info in

[jira] [Created] (YARN-3328) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-10 Thread sandflee (JIRA)
sandflee created YARN-3328: -- Summary: There's no way to rebuild containers Managed by NMClientAsync If AM restart Key: YARN-3328 URL: https://issues.apache.org/jira/browse/YARN-3328 Project: Hadoop YARN

[jira] [Created] (YARN-3329) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-10 Thread sandflee (JIRA)
sandflee created YARN-3329: -- Summary: There's no way to rebuild containers Managed by NMClientAsync If AM restart Key: YARN-3329 URL: https://issues.apache.org/jira/browse/YARN-3329 Project: Hadoop YARN

[jira] [Resolved] (YARN-3329) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-03-10 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-3329. Resolution: Done Release Note: the same to YARN-3328, sorry for creating twice There's no way to

[jira] [Created] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-03-23 Thread sandflee (JIRA)
sandflee created YARN-3387: -- Summary: container complete message couldn't pass to am if am restarted and rm changed Key: YARN-3387 URL: https://issues.apache.org/jira/browse/YARN-3387 Project: Hadoop YARN

[jira] [Commented] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-03-23 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377019#comment-14377019 ] sandflee commented on YARN-3387: yes container complete message couldn't pass to am if am

[jira] [Updated] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-04-20 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3387: --- Attachment: YARN-3387.002.patch ut added container complete message couldn't pass to am if am restarted and

[jira] [Created] (YARN-3519) registerApplicationMaster couldn't get all running containers if rm is rebuilding container info while am is relaunched

2015-04-21 Thread sandflee (JIRA)
sandflee created YARN-3519: -- Summary: registerApplicationMaster couldn't get all running containers if rm is rebuilding container info while am is relaunched Key: YARN-3519 URL:

[jira] [Created] (YARN-3518) default rm/am expire interval should less than default resourcemanager connect wait time

2015-04-21 Thread sandflee (JIRA)
sandflee created YARN-3518: -- Summary: default rm/am expire interval should less than default resourcemanager connect wait time Key: YARN-3518 URL: https://issues.apache.org/jira/browse/YARN-3518 Project:

[jira] [Created] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-25 Thread sandflee (JIRA)
sandflee created YARN-3546: -- Summary: AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it Key: YARN-3546 URL: https://issues.apache.org/jira/browse/YARN-3546

[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled

2015-04-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512769#comment-14512769 ] sandflee commented on YARN-3533: getApplicationAttempt seems confusing, I just opened

[jira] [Commented] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-04-22 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507255#comment-14507255 ] sandflee commented on YARN-3387: It seems a bug in LaunchAM in MockRM.java, in LaunchAM: 1,

[jira] [Commented] (YARN-3519) registerApplicationMaster couldn't get all running containers if rm is rebuilding container info while am is relaunched

2015-04-21 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505894#comment-14505894 ] sandflee commented on YARN-3519: yes, the same issue registerApplicationMaster couldn't

[jira] [Resolved] (YARN-3519) registerApplicationMaster couldn't get all running containers if rm is rebuilding container info while am is relaunched

2015-04-21 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-3519. Resolution: Duplicate registerApplicationMaster couldn't get all running containers if rm is rebuilding

[jira] [Commented] (YARN-3519) registerApplicationMaster couldn't get all running containers if rm is rebuilding container info while am is relaunched

2015-04-21 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506269#comment-14506269 ] sandflee commented on YARN-3519: not easy to fix, I'll think more

[jira] [Commented] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-04-22 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507266#comment-14507266 ] sandflee commented on YARN-2038: If nm register to rm in a short time, we can add a

[jira] [Commented] (YARN-3387) Previous AM's container complete message couldn't pass to current am if am restarted and rm changed

2015-04-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512065#comment-14512065 ] sandflee commented on YARN-3387: Thanks He Jian and Anubhav Previous AM's container

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-04-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.001.patch I don't know why DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS is 15min, just

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-04-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Summary: default rm/am expire interval should not less than default resourcemanager connect wait time (was:

[jira] [Commented] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled

2015-04-22 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508476#comment-14508476 ] sandflee commented on YARN-3533: thanks for you patch, 1, waitForSchedulerAppAttemptAdded

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2015-05-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525143#comment-14525143 ] sandflee commented on YARN-3554: set this to a bigger value maybe based on network

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2015-05-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525264#comment-14525264 ] sandflee commented on YARN-3554: Hi [~Naganarasimha] 3 mins seems dangerous, If rm fails

[jira] [Commented] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522456#comment-14522456 ] sandflee commented on YARN-3546: ok, close it now, thanks [~jianhe]

[jira] [Resolved] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-3546. Resolution: Not A Problem AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're

[jira] [Commented] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520706#comment-14520706 ] sandflee commented on YARN-3546: [~jianhe], thanks for your explanation, I stil have one

[jira] [Commented] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520819#comment-14520819 ] sandflee commented on YARN-3546: sorry for my explanation. Let's consider below situation,

[jira] [Commented] (YARN-3546) AbstractYarnScheduler.getApplicationAttempt seems misleading, and there're some misuse of it

2015-04-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520838#comment-14520838 ] sandflee commented on YARN-3546: The implement of

[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527706#comment-14527706 ] sandflee commented on YARN-3518: agree, we should set nm, am, client separately default

[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532755#comment-14532755 ] sandflee commented on YARN-3480: one benefit in [~hex108]'s work is we wouldn't worry about

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548123#comment-14548123 ] sandflee commented on YARN-3668: thanks [~stevel], we're using our own AM not slider, and

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547434#comment-14547434 ] sandflee commented on YARN-3668: seems not enough,if AM crashed on launch because of AM's

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547496#comment-14547496 ] sandflee commented on YARN-3668: I don't want the service to terminated if AM goes down,

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.002.patch replace RESOURCEMANAGER_CONNECT_MAX_WAIT_MS with

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549506#comment-14549506 ] sandflee commented on YARN-3668: yes, I agree it's purely a problem of AM,but it seems a

[jira] [Commented] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491469#comment-14491469 ] sandflee commented on YARN-3387: Jian He, thanks for the reiew. Yes, they're same right

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546971#comment-14546971 ] sandflee commented on YARN-3644: If RM is down, NM's connection will be reset by RM

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547155#comment-14547155 ] sandflee commented on YARN-3644: [~raju.bairishetti] thanks for your reply, If RM HA is

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547159#comment-14547159 ] sandflee commented on YARN-3644: In our cluster we also have to face this problem, I'd like

[jira] [Created] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
sandflee created YARN-3668: -- Summary: Long run service shouldn't be killed even if Yarn crashed Key: YARN-3668 URL: https://issues.apache.org/jira/browse/YARN-3668 Project: Hadoop YARN Issue Type:

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547165#comment-14547165 ] sandflee commented on YARN-3668: If all RM crashed, all running containers will be killed,

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547168#comment-14547168 ] sandflee commented on YARN-3668: If am crashed and reaches am max fail times, applications

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.004.patch remove checkstyle warning default rm/am expire interval should not less than

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560549#comment-14560549 ] sandflee commented on YARN-3668: when the AM restarts its JARs are re-downloaded from HDFS.

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560550#comment-14560550 ] sandflee commented on YARN-3668: when the AM restarts its JARs are re-downloaded from HDFS.

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-26 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.003.patch default rm/am expire interval should not less than default resourcemanager

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560111#comment-14560111 ] sandflee commented on YARN-3644: Thanks [~vinodkv], what my concerns is long running

[jira] [Commented] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-08-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692709#comment-14692709 ] sandflee commented on YARN-2038: I thought it's the same issue to YARN-3519, but it seems

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.01.patch ContainerKillEvent is lost when container is In New State and is recovering

[jira] [Created] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-08-13 Thread sandflee (JIRA)
sandflee created YARN-4050: -- Summary: NM event dispatcher may blocked by LogAggregationService if NameNode is slow Key: YARN-4050 URL: https://issues.apache.org/jira/browse/YARN-4050 Project: Hadoop YARN

[jira] [Created] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-13 Thread sandflee (JIRA)
sandflee created YARN-4051: -- Summary: ContainerKillEvent is lost when container is In New State and is recovering Key: YARN-4051 URL: https://issues.apache.org/jira/browse/YARN-4051 Project: Hadoop YARN

[jira] [Resolved] (YARN-4040) container complete msg should passed to AM,even if the container is released.

2015-08-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-4040. Resolution: Not A Problem If AM release a container, the complete msg(released by AM) is stored by

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.03.patch pending kill event while container is recovered. and just act like

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699745#comment-14699745 ] sandflee commented on YARN-4051: if recovered as REQUESTED, try to cleanup container

[jira] [Commented] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-07-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645136#comment-14645136 ] sandflee commented on YARN-3987: yes, we set getKeepContainersAcrossApplicationAttempts

[jira] [Commented] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-07-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645341#comment-14645341 ] sandflee commented on YARN-3987: AM crashes before it register to RM am container

[jira] [Updated] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-07-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3987: --- Attachment: YARN-3987.002.patch am container complete msg ack to NM once RM receive it

[jira] [Commented] (YARN-4005) Completed container whose app is finished is not removed from NMStateStore

2015-08-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650644#comment-14650644 ] sandflee commented on YARN-4005: seems there's no need to add to recentlyStoppedContainers,

[jira] [Created] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-07-28 Thread sandflee (JIRA)
sandflee created YARN-3987: -- Summary: am container complete msg ack to NM once RM receive it Key: YARN-3987 URL: https://issues.apache.org/jira/browse/YARN-3987 Project: Hadoop YARN Issue Type: Bug

[jira] [Updated] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-07-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3987: --- Attachment: YARN-3987.001.patch am container complete msg ack to NM once RM receive it

[jira] [Assigned] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-08-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-4050: -- Assignee: sandflee NM event dispatcher may blocked by LogAggregationService if NameNode is slow

[jira] [Commented] (YARN-3987) am container complete msg ack to NM once RM receive it

2015-08-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696433#comment-14696433 ] sandflee commented on YARN-3987: Thanks [~jianhe]! am container complete msg ack to NM

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.02.patch fix check style errors ContainerKillEvent is lost when container is In New

[jira] [Created] (YARN-4040) container complete msg should passed to AM,even if the container is released.

2015-08-10 Thread sandflee (JIRA)
sandflee created YARN-4040: -- Summary: container complete msg should passed to AM,even if the container is released. Key: YARN-4040 URL: https://issues.apache.org/jira/browse/YARN-4040 Project: Hadoop YARN

[jira] [Created] (YARN-4020) Exception happens while stopContainer in AM

2015-08-05 Thread sandflee (JIRA)
sandflee created YARN-4020: -- Summary: Exception happens while stopContainer in AM Key: YARN-4020 URL: https://issues.apache.org/jira/browse/YARN-4020 Project: Hadoop YARN Issue Type: Bug

[jira] [Commented] (YARN-3327) if NMClientAsync stopContainer failed because of IOException, there's no chance to stopContainer again

2015-07-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626528#comment-14626528 ] sandflee commented on YARN-3327: There is no logs any more, it's a long time and I just fix

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.04.patch NM register to RM after all containers are recovered by default, and user could

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001725#comment-15001725 ] sandflee commented on YARN-4051: thanks [~jlowe] Should the value be infinite by default? The concern is

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.05.patch set default timeout to 2min, since default nm expire timeout is 10min >

[jira] [Commented] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001734#comment-15001734 ] sandflee commented on YARN-4050: There may be 2 problems: 1, nm dispatcher maybe blocked by logaggregation

[jira] [Updated] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4050: --- Assignee: (was: sandflee) > NM event dispatcher may blocked by LogAggregationService if NameNode is slow >

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996345#comment-14996345 ] sandflee commented on YARN-4051: Is it possible for the finish application or complete container requests

[jira] [Commented] (YARN-4020) Exception happens while stopContainer in AM

2015-11-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984590#comment-14984590 ] sandflee commented on YARN-4020: seems new masterkey are synced to NM but not to AM.I'll try to fix it. >

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984580#comment-14984580 ] sandflee commented on YARN-4051: Thanks Jason, sorry for just noticed your reply. It's more reasonable

[jira] [Created] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
sandflee created YARN-4277: -- Summary: containers would be leaked if nm crashed and rm failover Key: YARN-4277 URL: https://issues.apache.org/jira/browse/YARN-4277 Project: Hadoop YARN Issue Type:

[jira] [Commented] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964388#comment-14964388 ] sandflee commented on YARN-4277: thanks [~jlowe] , you have explained very clearly. > containers would be

[jira] [Commented] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964387#comment-14964387 ] sandflee commented on YARN-4277: yes, this's a problem in our cluster, our NM hangs for a long time because

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-08-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710716#comment-14710716 ] sandflee commented on YARN-4051: could anyone help to review it? ContainerKillEvent is

[jira] [Created] (YARN-4426) unhealthy disk makes NM LOST

2015-12-06 Thread sandflee (JIRA)
sandflee created YARN-4426: -- Summary: unhealthy disk makes NM LOST Key: YARN-4426 URL: https://issues.apache.org/jira/browse/YARN-4426 Project: Hadoop YARN Issue Type: Bug Reporter:

[jira] [Resolved] (YARN-4426) unhealthy disk makes NM LOST

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-4426. Resolution: Duplicate > unhealthy disk makes NM LOST > > > Key:

[jira] [Commented] (YARN-4426) unhealthy disk makes NM LOST

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046203#comment-15046203 ] sandflee commented on YARN-4426: Thanks [~suda], they are caused by hanged mkdir > unhealthy disk makes NM

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046252#comment-15046252 ] sandflee commented on YARN-4301: it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061577#comment-15061577 ] sandflee commented on YARN-1197: seems complicated for AM to do this, especially we added disk,network to

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061347#comment-15061347 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn platform, they

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061348#comment-15061348 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn platform, they

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061349#comment-15061349 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn platform, they

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061346#comment-15061346 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn platform, they

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057229#comment-15057229 ] sandflee commented on YARN-4138: got it, thanks for your explain! > Roll back container resource

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055330#comment-15055330 ] sandflee commented on YARN-4138: Hi, [~mding], consider such situation: 1) AM request increase request to

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063576#comment-15063576 ] sandflee commented on YARN-4138: {quote} We should not update lastConfirmedResource in this scenario. This

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052468#comment-15052468 ] sandflee commented on YARN-4138: if AM increase container size successful in NM, but resource increase

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059332#comment-15059332 ] sandflee commented on YARN-1197: seems not support increase memory and decrease cpu cores meanwhile? >

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059362#comment-15059362 ] sandflee commented on YARN-1197: got it, Thanks,[~leftnoteasy]! > Support changing resources of an

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059777#comment-15059777 ] sandflee commented on YARN-4138: 1, use Resources.fitsin(targetResource, lastConfirmedResource)?

[jira] [Updated] (YARN-4520) FinishAppEvent is leaked in leveldb if no app's container running on this node

2016-01-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4520: --- Description: once we restart nodemanager we see many logs like : 2015-12-28 11:59:18,725 WARN

[jira] [Updated] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4528: --- Attachment: YARN-4528.01.patch 1, pending container decrease msg util next heartbeat. 2, nodemanager#allocate

  1   2   3   4   5   >