[jira] [Updated] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4940: --- Attachment: YARN-4940.03.patch > yarn node -list -all failed if RM start with decommissioned node > ---

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241470#comment-15241470 ] sandflee commented on YARN-4924: Thanks [~jlowe], I had catch All exception in cleanupDepre

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241476#comment-15241476 ] sandflee commented on YARN-4924: leveldbIterator may also throws DBException, yes? > NM re

[jira] [Updated] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4924: --- Attachment: YARN-4924.05.patch > NM recovery race can lead to container not cleaned up > --

[jira] [Commented] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241508#comment-15241508 ] sandflee commented on YARN-4939: Thanks [~templedf], I have add a test to the patch, and t

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242113#comment-15242113 ] sandflee commented on YARN-4924: thanks [~jlowe] for viewing and suggest, thanks [~nroberts

[jira] [Created] (YARN-4962) support filling up containers on node one by one

2016-04-15 Thread sandflee (JIRA)
sandflee created YARN-4962: -- Summary: support filling up containers on node one by one Key: YARN-4962 URL: https://issues.apache.org/jira/browse/YARN-4962 Project: Hadoop YARN Issue Type: Improveme

[jira] [Commented] (YARN-4962) support filling up containers on node one by one

2016-04-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242589#comment-15242589 ] sandflee commented on YARN-4962: one simple way is enable continuous Scheduling, allocate

[jira] [Updated] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4940: --- Attachment: YARN-4940.04.patch > yarn node -list -all failed if RM start with decommissioned node > ---

[jira] [Updated] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4940: --- Attachment: YARN-4940.05.patch > yarn node -list -all failed if RM start with decommissioned node > ---

[jira] [Commented] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242768#comment-15242768 ] sandflee commented on YARN-4940: Thanks [~templedf] [~kshukla], update the patch, and I do

[jira] [Commented] (YARN-4962) support filling up containers on node one by one

2016-04-21 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253030#comment-15253030 ] sandflee commented on YARN-4962: Thanks [~templedf], node labels seems couldn't solve our

[jira] [Created] (YARN-5022) NM shutdown takes too much time

2016-05-01 Thread sandflee (JIRA)
sandflee created YARN-5022: -- Summary: NM shutdown takes too much time Key: YARN-5022 URL: https://issues.apache.org/jira/browse/YARN-5022 Project: Hadoop YARN Issue Type: Improvement Rep

[jira] [Updated] (YARN-5022) NM shutdown takes too much time

2016-05-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5022: --- Attachment: nm.log > NM shutdown takes too much time > --- > > Key:

[jira] [Updated] (YARN-5022) NM shutdown takes too much time

2016-05-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5022: --- Attachment: YARN-5022.01.patch > NM shutdown takes too much time > --- > >

[jira] [Commented] (YARN-5022) NM shutdown takes too much time

2016-05-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266110#comment-15266110 ] sandflee commented on YARN-5022: this is mainly caused by NonAggregationLogHandler, bq. 1,

[jira] [Commented] (YARN-5022) NM shutdown takes too much time

2016-05-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266112#comment-15266112 ] sandflee commented on YARN-5022: correct: applications couldn't be removed util log deleter

[jira] [Created] (YARN-5027) NM should clean up app log dirs after NM restart

2016-05-02 Thread sandflee (JIRA)
sandflee created YARN-5027: -- Summary: NM should clean up app log dirs after NM restart Key: YARN-5027 URL: https://issues.apache.org/jira/browse/YARN-5027 Project: Hadoop YARN Issue Type: Bug

[jira] [Commented] (YARN-5022) NM shutdown takes too much time

2016-05-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266614#comment-15266614 ] sandflee commented on YARN-5022: notice that NM wouldn't clean app log dir if nm start with

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266728#comment-15266728 ] sandflee commented on YARN-5023: {code} // launch next AM in nm2 nm2.nodeHeartbeat(

[jira] [Created] (YARN-5031) Add a conf to disable container reservation

2016-05-02 Thread sandflee (JIRA)
sandflee created YARN-5031: -- Summary: Add a conf to disable container reservation Key: YARN-5031 URL: https://issues.apache.org/jira/browse/YARN-5031 Project: Hadoop YARN Issue Type: Improvement

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268848#comment-15268848 ] sandflee commented on YARN-5023: one thing maybe we could improve: even the first nodeHeart

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268840#comment-15268840 ] sandflee commented on YARN-5023: Hi [~sunilg], I think the main problem is we shoudn't send

[jira] [Updated] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5023: --- Attachment: YARN-5023.01.patch > TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure >

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269819#comment-15269819 ] sandflee commented on YARN-5023: update the patch, simple remove the first nodeHeatbeat bef

[jira] [Created] (YARN-5037) TestRMRestart#testQueueMetricsOnRMRestart random faiure

2016-05-03 Thread sandflee (JIRA)
sandflee created YARN-5037: -- Summary: TestRMRestart#testQueueMetricsOnRMRestart random faiure Key: YARN-5037 URL: https://issues.apache.org/jira/browse/YARN-5037 Project: Hadoop YARN Issue Type: Tes

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270010#comment-15270010 ] sandflee commented on YARN-5023: test failure not related to the patch, TestRMRestart fai

[jira] [Commented] (YARN-5037) TestRMRestart#testQueueMetricsOnRMRestart random faiure

2016-05-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270014#comment-15270014 ] sandflee commented on YARN-5037: test on the latest trunk and could also see the failure >

[jira] [Commented] (YARN-5031) Add a conf to disable container reservation

2016-05-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270245#comment-15270245 ] sandflee commented on YARN-5031: We had a gpu cluster with just one queue, 10+ machines. m

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271523#comment-15271523 ] sandflee commented on YARN-5023: Hi, [~bibinchundatt] I run several times, and couldn't rep

[jira] [Commented] (YARN-5031) Add a conf to disable container reservation

2016-05-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271559#comment-15271559 ] sandflee commented on YARN-5031: thanks [~sunilg], I'll take a look at YARN-1769 > Add a c

[jira] [Created] (YARN-5043) TestAMRestart.testRMAppAttemptFailuresValidityInterval random fail

2016-05-05 Thread sandflee (JIRA)
sandflee created YARN-5043: -- Summary: TestAMRestart.testRMAppAttemptFailuresValidityInterval random fail Key: YARN-5043 URL: https://issues.apache.org/jira/browse/YARN-5043 Project: Hadoop YARN Iss

[jira] [Updated] (YARN-5043) TestAMRestart.testRMAppAttemptFailuresValidityInterval random fail

2016-05-05 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5043: --- Attachment: TestAMRestart-output.txt > TestAMRestart.testRMAppAttemptFailuresValidityInterval random fail > ---

[jira] [Commented] (YARN-5023) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

2016-05-05 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272139#comment-15272139 ] sandflee commented on YARN-5023: I write a script to auto test and reproduce it, file YARN-

[jira] [Created] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-13 Thread sandflee (JIRA)
sandflee created YARN-5082: -- Summary: ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit Key: YARN-5082 URL: https://issues.apache.org/jira/browse/YARN-5082 Pr

[jira] [Commented] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282911#comment-15282911 ] sandflee commented on YARN-5082: we enable continuous scheduling, so containerID increased

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: YARN-5082.01.patch > ContainerId rapidly increased in fair scheduler if the num of node app > res

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: YARN-5082.01.patch > ContainerId rapidly increased in fair scheduler if the num of node app > res

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: (was: YARN-5082.01.patch) > ContainerId rapidly increased in fair scheduler if the num of node

[jira] [Updated] (YARN-5027) NM should clean up app log dirs after NM restart

2016-05-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5027: --- Attachment: YARN-5027.01.patch > NM should clean up app log dirs after NM restart > --

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: YARN-5082.02.patch rename FSAppAttempt.priority to FSAppAttempt.appPriority to fix checkstyle warn

[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler

2016-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286918#comment-15286918 ] sandflee commented on YARN-2098: Hi, [~ywskycn], is this issue still works? > App priority

[jira] [Commented] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-22 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295825#comment-15295825 ] sandflee commented on YARN-5082: test failure are not related, cc [~asuresh] [~kasha] could

[jira] [Commented] (YARN-5133) Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW

2016-05-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297910#comment-15297910 ] sandflee commented on YARN-5133: dedup of YARN-4741 ? > Can't handle this event at curren

[jira] [Commented] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299085#comment-15299085 ] sandflee commented on YARN-5082: Thanks [~asuresh], 1, priority was defined in FSAppAttempt

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2016-05-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299383#comment-15299383 ] sandflee commented on YARN-4599: our cluster had implemented this, the solution is similar

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2016-05-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299606#comment-15299606 ] sandflee commented on YARN-4599: Thanks [~kasha], bq. Any metrics on how often the NM has t

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: YARN-5082.03.patch > ContainerId rapidly increased in fair scheduler if the num of node app > res

[jira] [Updated] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5082: --- Attachment: YARN-5082.04.patch > ContainerId rapidly increased in fair scheduler if the num of node app > res

[jira] [Created] (YARN-5157) TestZKRMStateStore randomly fail

2016-05-25 Thread sandflee (JIRA)
sandflee created YARN-5157: -- Summary: TestZKRMStateStore randomly fail Key: YARN-5157 URL: https://issues.apache.org/jira/browse/YARN-5157 Project: Hadoop YARN Issue Type: Bug Reporter:

[jira] [Commented] (YARN-5082) ContainerId rapidly increased in fair scheduler if the num of node app reserved reached the limit

2016-05-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301106#comment-15301106 ] sandflee commented on YARN-5082: TestZKRMStateStore could run locally, seems not related to

[jira] [Created] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
sandflee created YARN-4277: -- Summary: containers would be leaked if nm crashed and rm failover Key: YARN-4277 URL: https://issues.apache.org/jira/browse/YARN-4277 Project: Hadoop YARN Issue Type: B

[jira] [Commented] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964387#comment-14964387 ] sandflee commented on YARN-4277: yes, this's a problem in our cluster, our NM hangs for a l

[jira] [Commented] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-10-19 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964388#comment-14964388 ] sandflee commented on YARN-4277: thanks [~jlowe] , you have explained very clearly. > cont

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984580#comment-14984580 ] sandflee commented on YARN-4051: Thanks Jason, sorry for just noticed your reply. It's m

[jira] [Commented] (YARN-4277) containers would be leaked if nm crashed and rm failover

2015-11-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984585#comment-14984585 ] sandflee commented on YARN-4277: Is there any plan to store NM info? [~jlowe] [~djp] [~jia

[jira] [Commented] (YARN-4020) Exception happens while stopContainer in AM

2015-11-01 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984590#comment-14984590 ] sandflee commented on YARN-4020: seems new masterkey are synced to NM but not to AM.I'll t

[jira] [Updated] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4050: --- Assignee: (was: sandflee) > NM event dispatcher may blocked by LogAggregationService if NameNode is slow >

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996345#comment-14996345 ] sandflee commented on YARN-4051: Is it possible for the finish application or complete cont

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.04.patch NM register to RM after all containers are recovered by default, and user could

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001725#comment-15001725 ] sandflee commented on YARN-4051: thanks [~jlowe] Should the value be infinite by default?

[jira] [Commented] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001734#comment-15001734 ] sandflee commented on YARN-4050: There may be 2 problems: 1, nm dispatcher maybe blocked b

[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.05.patch set default timeout to 2min, since default nm expire timeout is 10min > Contai

[jira] [Created] (YARN-4426) unhealthy disk makes NM LOST

2015-12-06 Thread sandflee (JIRA)
sandflee created YARN-4426: -- Summary: unhealthy disk makes NM LOST Key: YARN-4426 URL: https://issues.apache.org/jira/browse/YARN-4426 Project: Hadoop YARN Issue Type: Bug Reporter: sand

[jira] [Resolved] (YARN-4426) unhealthy disk makes NM LOST

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-4426. Resolution: Duplicate > unhealthy disk makes NM LOST > > > Key:

[jira] [Commented] (YARN-4426) unhealthy disk makes NM LOST

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046203#comment-15046203 ] sandflee commented on YARN-4426: Thanks [~suda], they are caused by hanged mkdir > unhealt

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046252#comment-15046252 ] sandflee commented on YARN-4301: it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRAC

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052468#comment-15052468 ] sandflee commented on YARN-4138: if AM increase container size successful in NM, but resour

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055330#comment-15055330 ] sandflee commented on YARN-4138: Hi, [~mding], consider such situation: 1) AM request inc

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057229#comment-15057229 ] sandflee commented on YARN-4138: got it, thanks for your explain! > Roll back container re

[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527706#comment-14527706 ] sandflee commented on YARN-3518: agree, we should set nm, am, client separately > default

[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532755#comment-14532755 ] sandflee commented on YARN-3480: one benefit in [~hex108]'s work is we wouldn't worry about

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546971#comment-14546971 ] sandflee commented on YARN-3644: If RM is down, NM's connection will be reset by RM machine

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547155#comment-14547155 ] sandflee commented on YARN-3644: [~raju.bairishetti] thanks for your reply, If RM HA is no

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547159#comment-14547159 ] sandflee commented on YARN-3644: In our cluster we also have to face this problem, I'd like

[jira] [Created] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
sandflee created YARN-3668: -- Summary: Long run service shouldn't be killed even if Yarn crashed Key: YARN-3668 URL: https://issues.apache.org/jira/browse/YARN-3668 Project: Hadoop YARN Issue Type: W

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547165#comment-14547165 ] sandflee commented on YARN-3668: If all RM crashed, all running containers will be killed,

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547168#comment-14547168 ] sandflee commented on YARN-3668: If am crashed and reaches am max fail times, applications

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547434#comment-14547434 ] sandflee commented on YARN-3668: seems not enough,if AM crashed on launch because of AM's b

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547496#comment-14547496 ] sandflee commented on YARN-3668: I don't want the service to terminated if AM goes down, ya

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548123#comment-14548123 ] sandflee commented on YARN-3668: thanks [~stevel], we're using our own AM not slider, and s

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.002.patch replace RESOURCEMANAGER_CONNECT_MAX_WAIT_MS with RESOURCETRACKER_RESOURCEMANAG

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-18 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549506#comment-14549506 ] sandflee commented on YARN-3668: yes, I agree it's purely a problem of AM,but it seems a bo

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-26 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.003.patch > default rm/am expire interval should not less than default resourcemanager >

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560111#comment-14560111 ] sandflee commented on YARN-3644: Thanks [~vinodkv], what my concerns is long running contai

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560549#comment-14560549 ] sandflee commented on YARN-3668: when the AM restarts its JARs are re-downloaded from HDFS.

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560550#comment-14560550 ] sandflee commented on YARN-3668: when the AM restarts its JARs are re-downloaded from HDFS.

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-3518: --- Attachment: YARN-3518.004.patch remove checkstyle warning > default rm/am expire interval should not less than

[jira] [Commented] (YARN-5082) Limit ContainerId increase in fair scheduler if the num of node app reserved reached the limit

2016-06-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326131#comment-15326131 ] sandflee commented on YARN-5082: Thanks [~asuresh] for reviewing and committing! > Limit C

[jira] [Commented] (YARN-4936) FileInputStream should be closed explicitly in NMWebService#getLogs

2016-06-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326147#comment-15326147 ] sandflee commented on YARN-4936: fixed by YARN-5199, close it. > FileInputStream should be

[jira] [Assigned] (YARN-4599) Set OOM control for memory cgroups

2016-06-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-4599: -- Assignee: sandflee (was: Karthik Kambatla) > Set OOM control for memory cgroups > -

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2016-06-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4599: --- Attachment: YARN-4599.sandflee.patch > Set OOM control for memory cgroups > --

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2016-06-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328950#comment-15328950 ] sandflee commented on YARN-4599: Hi [~kasha], sorry for the delay, update a initial patch t

[jira] [Created] (YARN-5254) capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator

2016-06-14 Thread sandflee (JIRA)
sandflee created YARN-5254: -- Summary: capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator Key: YARN-5254 URL: https://issues.apache.org/jira/browse/YARN-5254 P

[jira] [Updated] (YARN-5254) capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator

2016-06-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5254: --- Attachment: YARN-5254.01.patch > capacity scheduler could only allocate a container with 1 vcore if using >

[jira] [Commented] (YARN-5254) capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator

2016-06-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331125#comment-15331125 ] sandflee commented on YARN-5254: vcore info is droped by DefaultResourceCalculator while no

[jira] [Commented] (YARN-5254) capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator

2016-06-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331538#comment-15331538 ] sandflee commented on YARN-5254: Thanks [~vvasudev], this behaviour seems a little confus

[jira] [Commented] (YARN-5254) capacity scheduler could only allocate a container with 1 vcore if using DefautResourceCalculator

2016-06-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331542#comment-15331542 ] sandflee commented on YARN-5254: correct, bq. 2, use {color:red} DominantResourceCalculator

[jira] [Commented] (YARN-5197) RM leaks containers if running container disappears from node update

2016-06-20 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340742#comment-15340742 ] sandflee commented on YARN-5197: Hi, [~jlowe], is this possible that container info disappe

[jira] [Created] (YARN-5276) print more info when event queue is blocked

2016-06-20 Thread sandflee (JIRA)
sandflee created YARN-5276: -- Summary: print more info when event queue is blocked Key: YARN-5276 URL: https://issues.apache.org/jira/browse/YARN-5276 Project: Hadoop YARN Issue Type: Improvement

<    1   2   3   4   5   >