[jira] [Updated] (YARN-8547) rm may crash if nm register with too many applications

2018-07-18 Thread sandflee (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-8547:
---
Attachment: YARN-8547.01.patch

> rm may crash if nm register with too many applications
> --
>
> Key: YARN-8547
> URL: https://issues.apache.org/jira/browse/YARN-8547
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
>Priority: Major
> Attachments: YARN-8547.01.patch
>
>
> 1, our cluster had n k+ nodes, and disabled log aggregation, one single nm 
> may keeps 1w+ apps 
> 2, when rm failover, single nm register with 1w+ apps, causing active rm 
> always gc and lost connection with zk.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8547) rm may crash if nm register with too many applications

2018-07-18 Thread sandflee (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-8547:
---
Description: 
1, our cluster had n k+ nodes, and disabled log aggregation, one single nm may 
keeps 1w+ apps 

2, when rm failover, single nm register with 1w+ apps, causing active rm always 
gc and lost connection with zk.  

  was:
1, our cluster had n k+ nodes, and we disable log aggregation, single nm may 
keeps 1w+ apps 

2, when rm failover, nm register with 1w+ apps, causing active rm always gc and 
lost connection with zk.  


> rm may crash if nm register with too many applications
> --
>
> Key: YARN-8547
> URL: https://issues.apache.org/jira/browse/YARN-8547
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
>Priority: Major
>
> 1, our cluster had n k+ nodes, and disabled log aggregation, one single nm 
> may keeps 1w+ apps 
> 2, when rm failover, single nm register with 1w+ apps, causing active rm 
> always gc and lost connection with zk.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8547) rm may crash if nm register with too many applications

2018-07-18 Thread sandflee (JIRA)
sandflee created YARN-8547:
--

 Summary: rm may crash if nm register with too many applications
 Key: YARN-8547
 URL: https://issues.apache.org/jira/browse/YARN-8547
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: sandflee
Assignee: sandflee


1, our cluster had n k+ nodes, and we disable log aggregation, single nm may 
keeps 1w+ apps 

2, when rm failover, nm register with 1w+ apps, causing active rm always gc and 
lost connection with zk.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher

2017-11-27 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267978#comment-16267978
 ] 

sandflee edited comment on YARN-7229 at 11/28/17 2:30 AM:
--

yes, planed to add this to our cluster, assign to myself


was (Author: sandflee):
yes, planed to add this to our cluster, assign this to myself

> Add a metric for the size of event queue in AsyncDispatcher
> ---
>
> Key: YARN-7229
> URL: https://issues.apache.org/jira/browse/YARN-7229
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: sandflee
>
> The size of event queue in AsyncDispatcher is a good point to monitor daemon 
> performance. Let's make it a RM metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher

2017-11-27 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267978#comment-16267978
 ] 

sandflee commented on YARN-7229:


yes, planed to add this to our cluster, assign this to myself

> Add a metric for the size of event queue in AsyncDispatcher
> ---
>
> Key: YARN-7229
> URL: https://issues.apache.org/jira/browse/YARN-7229
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: sandflee
>
> The size of event queue in AsyncDispatcher is a good point to monitor daemon 
> performance. Let's make it a RM metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher

2017-11-25 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-7229:
--

Assignee: sandflee

> Add a metric for the size of event queue in AsyncDispatcher
> ---
>
> Key: YARN-7229
> URL: https://issues.apache.org/jira/browse/YARN-7229
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: sandflee
>
> The size of event queue in AsyncDispatcher is a good point to monitor daemon 
> performance. Let's make it a RM metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher

2017-11-23 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264890#comment-16264890
 ] 

sandflee commented on YARN-7229:


YARN-5276, seems do similar things [~asuresh] 

> Add a metric for the size of event queue in AsyncDispatcher
> ---
>
> Key: YARN-7229
> URL: https://issues.apache.org/jira/browse/YARN-7229
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>
> The size of event queue in AsyncDispatcher is a good point to monitor daemon 
> performance. Let's make it a RM metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7498) NM failed to start if the namespace of remote log dirs differs from fs.defaultFS

2017-11-14 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee resolved YARN-7498.

Resolution: Duplicate

> NM failed to start if the namespace of remote log dirs differs from 
> fs.defaultFS
> 
>
> Key: YARN-7498
> URL: https://issues.apache.org/jira/browse/YARN-7498
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
>
> fs.defaultFS is hdfs://nameservice1 and yarn.nodemanager.remote-app-log-dir 
> is hdfs://nameservice2, when nm start see errors:
> {quote}
> java.lang.IllegalArgumentException: Wrong FS: hdfs://nameservice2/yarn-logs, 
> expected: hdfs://nameservice1
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1128)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:192)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7498) NM failed to start if the namespace of remote log dirs differs from fs.defaultFS

2017-11-14 Thread sandflee (JIRA)
sandflee created YARN-7498:
--

 Summary: NM failed to start if the namespace of remote log dirs 
differs from fs.defaultFS
 Key: YARN-7498
 URL: https://issues.apache.org/jira/browse/YARN-7498
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: sandflee
Assignee: sandflee


fs.defaultFS is hdfs://nameservice1 and yarn.nodemanager.remote-app-log-dir is 
hdfs://nameservice2, when nm start see errors:
{quote}
java.lang.IllegalArgumentException: Wrong FS: hdfs://nameservice2/yarn-logs, 
expected: hdfs://nameservice1
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:192)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
  at java.lang.Thread.run(Thread.java:745)
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2017-09-30 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187004#comment-16187004
 ] 

sandflee commented on YARN-4599:


Hi, [~miklos.szeg...@cloudera.com], I'm busy on other work recently , feel free 
to assign this to yourself, will join you when not so busy.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: yarn-4599-not-so-useful.patch, YARN-4599.sandflee.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7035) Add health checker to ResourceManager

2017-08-20 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134636#comment-16134636
 ] 

sandflee commented on YARN-7035:


Thanks [~yufeigu], YARN-6061 is very useful to handle critical thread exit, for 
dead locks we use ThreadMxBean to detect. 

> Add health checker to ResourceManager
> -
>
> Key: YARN-7035
> URL: https://issues.apache.org/jira/browse/YARN-7035
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>
> RM may becomes unhealthy but still alive,  for example scheduling thread 
> exit, dead lock happens.  seems useful to add a healthy checker service, if 
> check failed, let RM exit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7035) Add health checker to ResourceManager

2017-08-17 Thread sandflee (JIRA)
sandflee created YARN-7035:
--

 Summary: Add health checker to ResourceManager
 Key: YARN-7035
 URL: https://issues.apache.org/jira/browse/YARN-7035
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: sandflee


RM may becomes unhealthy but still alive,  for example scheduling thread exit, 
dead lock happens.  seems useful to add a healthy checker service, if check 
failed, let RM exit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5349) TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail intermittently

2017-07-31 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108121#comment-16108121
 ] 

sandflee commented on YARN-5349:


not working on this, set unassigned.

> TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart  fail 
> intermittently
> -
>
> Key: YARN-5349
> URL: https://issues.apache.org/jira/browse/YARN-5349
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: sandflee
>Priority: Minor
>
> {noformat}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testUAMRecoveryOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:1463)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/12250/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestWorkPreservingRMRestart/testUAMRecoveryOnRMWorkPreservingRestart/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5349) TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail intermittently

2017-07-31 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-5349:
--

Assignee: (was: sandflee)

> TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart  fail 
> intermittently
> -
>
> Key: YARN-5349
> URL: https://issues.apache.org/jira/browse/YARN-5349
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: sandflee
>Priority: Minor
>
> {noformat}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testUAMRecoveryOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:1463)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/12250/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestWorkPreservingRMRestart/testUAMRecoveryOnRMWorkPreservingRestart/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6854) many job failed if NM couldn't detect disk error

2017-07-21 Thread sandflee (JIRA)
sandflee created YARN-6854:
--

 Summary: many job failed if NM couldn't detect disk error
 Key: YARN-6854
 URL: https://issues.apache.org/jira/browse/YARN-6854
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: sandflee
Priority: Critical


checkDiskHealthy is enabled, but it couldn't find this error. leading 
containers failed and new containers assigned to this node then failed again. 
the disk error seems a filesystem error, all io operation (such as ls) failed 
on $localdir/usercache/userFoo,  and no effect on other dir. 
Any suggestion?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928224#comment-15928224
 ] 

sandflee commented on YARN-4051:


Thanks [~jlowe] for your review and commit!

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3
>
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, 
> YARN-4051.08.patch-branch-2
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927733#comment-15927733
 ] 

sandflee commented on YARN-4051:


update a patch for branch-2

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, 
> YARN-4051.08.patch-branch-2
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.08.patch-branch-2

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, 
> YARN-4051.08.patch-branch-2
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925696#comment-15925696
 ] 

sandflee commented on YARN-4051:


patch updated, also fix test failure

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6340) TestEventFlow not work as expected

2017-03-15 Thread sandflee (JIRA)
sandflee created YARN-6340:
--

 Summary: TestEventFlow not work as expected
 Key: YARN-6340
 URL: https://issues.apache.org/jira/browse/YARN-6340
 Project: Hadoop YARN
  Issue Type: Test
Reporter: sandflee


see many exceptions in test logs, app/container never running, surprising the 
test could run pass.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2017-03-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925683#comment-15925683
 ] 

sandflee commented on YARN-4599:


Hi, [~miklos.szeg...@cloudera.com],  I think the first thing is whether  
community accept this propose, if yes, we could push forward, if not it will be 
a burden to keep the patch with the trunk. Thought?
 use Linux Container Executor seems simple the code but will use a new process 
additionally, I'm ok to both.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: yarn-4599-not-so-useful.patch, YARN-4599.sandflee.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes

2017-03-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.08.patch

> ContainerKillEvent lost when container is still recovering and application 
> finishes
> ---
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-13 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907544#comment-15907544
 ] 

sandflee commented on YARN-4051:


Thanks [~jlowe],  
bq. I'm also wondering about the scenario where the kill event is coming in 
from an AM and not the RM. 
simple throw a YarnException when AM stops a recovering container, but seems 
NMClientAsyncImpl could't try stopContainer again, we could fix this in a new 
issue? 
{code}
.addTransition(ContainerState.RUNNING,
EnumSet.of(ContainerState.DONE, ContainerState.FAILED),
ContainerEventType.STOP_CONTAINER,
new StopContainerTransition())
{code}
do another two changes:
1, using app.handle(new ApplicationContainerInitEvent(container)) when recover 
containers, for there is a race condition when Finish events comes, 
ApplicationContainerInitEvent not processed and containers are not added to app
2, use ConcurrentHashMap to store containers in app. because I encountered 
ConcurrentModifyException when iterating app.getContainers() , and I also see 
web and AppLogAggregator using app.getContainers() without protect.

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-13 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.07.patch

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, 
> YARN-4051.06.patch, YARN-4051.07.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-07 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.06.patch

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-07 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: (was: YARN-4051.06.patch)

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-07 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900345#comment-15900345
 ] 

sandflee commented on YARN-4051:


since RM will resend FINISH_APPS/FINISH_CONTAINER if nm reports app/container 
running, seems safe to drop the event if container is recovering, [~jlowe]

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5301) NM mount cpu cgroups failed on some system

2017-03-07 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-5301:
--

Assignee: (was: sandflee)

> NM mount cpu cgroups failed on some system
> --
>
> Key: YARN-5301
> URL: https://issues.apache.org/jira/browse/YARN-5301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>
> on ubuntu  with linux kernel 3.19, , NM start failed if enable auto mount 
> cgroup. try command:
> ./bin/container-executor --mount-cgroups yarn-hadoop cpu=/cgroup/cpufail
> ./bin/container-executor --mount-cgroups yarn-hadoop cpu,cpuacct=/cgroup/cpu  
>   succ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2017-03-07 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.06.patch

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true

2016-11-21 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684917#comment-15684917
 ] 

sandflee commented on YARN-5911:


sorry, not noticed it was removed in the patch, 
patch LGTM

> DrainDispatcher does not drain all events on stop even if 
> setDrainEventsOnStop is true
> --
>
> Key: YARN-5911
> URL: https://issues.apache.org/jira/browse/YARN-5911
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-5911.01.patch, YARN-5911.02.patch
>
>
> DrainDispatcher#serviceStop sets the stopped flag first before draining the 
> event queue.
> This means that the thread terminates as soon as it encounters stopped flag 
> as true and does not continue to process leftover events in queue, something 
> which it should do if setDrainEventsOnStop is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true

2016-11-21 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683146#comment-15683146
 ] 

sandflee commented on YARN-5911:


seems no need to keep var stopped in DrainDispatcher?

> DrainDispatcher does not drain all events on stop even if 
> setDrainEventsOnStop is true
> --
>
> Key: YARN-5911
> URL: https://issues.apache.org/jira/browse/YARN-5911
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-5911.01.patch, YARN-5911.02.patch
>
>
> DrainDispatcher#serviceStop sets the stopped flag first before draining the 
> event queue.
> This means that the thread terminates as soon as it encounters stopped flag 
> as true and does not continue to process leftover events in queue, something 
> which it should do if setDrainEventsOnStop is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true

2016-11-18 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678114#comment-15678114
 ] 

sandflee commented on YARN-5911:


thanks [~varun_saxena], one minor comment:
if events not drained, we should at least wait 1s since drainDispatcher not 
invoke waitForDrained.notify, could we use a shorter time?
{code}
while (!isDrained() && eventHandlingThread != null
&& eventHandlingThread.isAlive()
&& System.currentTimeMillis() < endTime) {
  waitForDrained.wait(1000);
  LOG.info("Waiting for AsyncDispatcher to drain. Thread state is :" +
  eventHandlingThread.getState());
}
  }
{code}

> DrainDispatcher does not drain all events on stop even if 
> setDrainEventsOnStop is true
> --
>
> Key: YARN-5911
> URL: https://issues.apache.org/jira/browse/YARN-5911
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-5911.01.patch
>
>
> DrainDispatcher#serviceStop sets the stopped flag first before draining the 
> event queue.
> This means that the thread terminates as soon as it encounters stopped flag 
> as true and does not continue to process leftover events in queue, something 
> which it should do if setDrainEventsOnStop is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5898) Container can not stop, because the call stopContainer NMClient method appears DIGEST-MD5 exception, onGetContainerStatusError NMClientAsync method is also the same

2016-11-17 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673309#comment-15673309
 ] 

sandflee commented on YARN-5898:


had AM ever restarted?

> Container can not stop, because the call stopContainer NMClient method 
> appears DIGEST-MD5 exception, onGetContainerStatusError NMClientAsync method 
> is also the same
> 
>
> Key: YARN-5898
> URL: https://issues.apache.org/jira/browse/YARN-5898
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 2.6.0
> Environment: cdh5.5,java 7
>Reporter: gaoyanfu
>  Labels: DIGEST-MD5, getContainerStatuses, 
> onGetContainerStatusError, stopContainer
> Fix For: 2.6.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> GetContainerStatusAsync call the NMClientAsync method, the callback method 
> corresponding onGetContainerStatusError method, DIGEST-MD5 SaslException, 
> ContainerStatus stopContainer can not get; call the nmClient method will be 
> the exception, not stop Container.
> ---REST API---
> request:
> http://server3.xdpp.boco:8042/ws/v1/node/containers
> response:
> {"containers":{"container":[
> {"id":"container_e07_1477704520017_0001_01_04","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":8704,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e07_1477704520017_0001_01_04/xdpp","nodeId":"server3.xdpp.boco:8041"},
> {"id":"container_e09_1477719748865_0003_01_25","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":1536,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e09_1477719748865_0003_01_25/xdpp","nodeId":"server3.xdpp.boco:8041"},
> {"id":"container_e09_1477719748865_0004_02_000103","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":6656,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e09_1477719748865_0004_02_000103/xdpp","nodeId":"server3.xdpp.boco:8041"}
> ]}}
> ---exception--
> 2016-11-14 11:17:12.725 ERROR containerStatusLogger 
> [ContainerManager.java:484] *Container onGetContainerStatusError deal 
> begin.containerId:container_e09_1477719748865_0003_01_25
> javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
> violation. Mismatched response.
>   at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown 
> Source) ~[na:na]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[na:1.7.0_79]
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
> ~[na:1.7.0_79]
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) 
> ~[hadoop-yarn-common-2.6.0.jar:na]
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) 
> ~[hadoop-yarn-common-2.6.0.jar:na]
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:127)
>  ~[hadoop-yarn-common-2.6.0.jar:na]
>   at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) ~[na:na]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[na:1.7.0_79]
>   at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_79]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  ~[hadoop-common-2.6.0.jar:na]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  ~[hadoop-common-2.6.0.jar:na]
>   at com.sun.proxy.$Proxy23.getContainerStatuses(Unknown Source) ~[na:na]
>   at 
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:267)
>  ~[hadoop-yarn-client-2.6.0.jar:na]
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:534)
>  ~[hadoop-yarn-client-2.6.0.jar:na]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_79]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_79]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
> Caused by: org.apache.hadoop.ipc.RemoteException: DIGEST-MD5: digest response 
> format violation. Mismatched 

[jira] [Assigned] (YARN-5897) using drainEvent to replace sleep-wait in MockRM#waitForState

2016-11-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-5897:
--

Assignee: sandflee

> using drainEvent to replace sleep-wait in MockRM#waitForState
> -
>
> Key: YARN-5897
> URL: https://issues.apache.org/jira/browse/YARN-5897
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5897) using drainEvent to replace sleep-wait in MockRM#waitForState

2016-11-16 Thread sandflee (JIRA)
sandflee created YARN-5897:
--

 Summary: using drainEvent to replace sleep-wait in 
MockRM#waitForState
 Key: YARN-5897
 URL: https://issues.apache.org/jira/browse/YARN-5897
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: sandflee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5896) add drainEvents to MockAM to reduce test failure

2016-11-16 Thread sandflee (JIRA)
sandflee created YARN-5896:
--

 Summary: add drainEvents to MockAM to reduce test failure
 Key: YARN-5896
 URL: https://issues.apache.org/jira/browse/YARN-5896
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: sandflee
Assignee: sandflee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672982#comment-15672982
 ] 

sandflee commented on YARN-5895:


I'm ok, close it with dup

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee resolved YARN-5895.

Resolution: Duplicate

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5548) Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672972#comment-15672972
 ] 

sandflee edited comment on YARN-5548 at 11/17/16 6:54 AM:
--

agree,  and maybe we could try to replace all MemoryStateStore with 
MockRMMemoryStateStore, not only in this test


was (Author: sandflee):
agree,  and maybe we could try to replace all MemoryStateStore with 
MockRMStateStore

> Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart
> --
>
> Key: YARN-5548
> URL: https://issues.apache.org/jira/browse/YARN-5548
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>  Labels: oct16-easy, test
> Attachments: YARN-5548.0001.patch, YARN-5548.0002.patch, 
> YARN-5548.0003.patch
>
>
> https://builds.apache.org/job/PreCommit-YARN-Build/12850/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testFinishedAppRemovalAfterRMRestart/
> {noformat}
> Error Message
> Stacktrace
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1471885197388 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1471885197417 
> application_state: RMAPP_FINISHED finish_time: 1471885197478>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1656)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5548) Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672972#comment-15672972
 ] 

sandflee commented on YARN-5548:


agree,  and maybe we could try to replace all MemoryStateStore with 
MockRMStateStore

> Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart
> --
>
> Key: YARN-5548
> URL: https://issues.apache.org/jira/browse/YARN-5548
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>  Labels: oct16-easy, test
> Attachments: YARN-5548.0001.patch, YARN-5548.0002.patch, 
> YARN-5548.0003.patch
>
>
> https://builds.apache.org/job/PreCommit-YARN-Build/12850/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testFinishedAppRemovalAfterRMRestart/
> {noformat}
> Error Message
> Stacktrace
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1471885197388 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1471885197417 
> application_state: RMAPP_FINISHED finish_time: 1471885197478>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1656)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672967#comment-15672967
 ] 

sandflee commented on YARN-5375:


yes, to keep the patch simple, not replace MemoryRMStateStore, let's go to 
YARN-5548

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, 
> YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672602#comment-15672602
 ] 

sandflee commented on YARN-5895:


yes, I think there are two main places not drain Events:
1, explicitly using MemoryStateStore like most RMRestart test, we could resolve 
by using MockRMMemoryStateStore
2, using MockRM.waitForState in static way, we could explicitly add 
rm.drainEvents() before call MockRM.waitForState or change MockRM.waitForState 
from static to non-static. thought?

bq. some of the tests in YARN-4929 need to look again.
will do



> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672604#comment-15672604
 ] 

sandflee commented on YARN-5895:


will do

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672483#comment-15672483
 ] 

sandflee commented on YARN-5895:


yes. 

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672377#comment-15672377
 ] 

sandflee edited comment on YARN-5895 at 11/17/16 1:48 AM:
--

thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't 
grant really drain all the events, so there is a litte change to cause test 
failure. especially for RMRestart test which is using MemoryStateStore not 
MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test 
failure completely.


was (Author: sandflee):
thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't 
grant really drain all the events, no there is a litte change to cause test 
failure. especially for RMRestart test which is using MemoryStateStore not 
MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test 
failure completely.

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672377#comment-15672377
 ] 

sandflee commented on YARN-5895:


thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't 
grant really drain all the events, no there is a litte change to cause test 
failure. especially for RMRestart test which is using MemoryStateStore not 
MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test 
failure completely.

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-5895:
--

Assignee: sandflee

> TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey 
> ---
>
> Key: YARN-5895
> URL: https://issues.apache.org/jira/browse/YARN-5895
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha1
>Reporter: Wilfred Spiegelenburg
>Assignee: sandflee
>
> Even after YARN-5362 the test is still flaky:
> {code}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.338 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: 
> 1479326359158 } application_name: "" queue: "default" priority { priority: 0 
> } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 
> resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" 
> keep_containers_across_application_attempts: false 
> attempt_failures_validity_interval: 0 am_container_resource_request { 
> priority { priority: 0 } resource_name: "*" capability { memory: 1024 
> virtual_cores: 1 } num_containers: 0 relax_locality: true 
> node_label_expression: "" execution_type_request { execution_type: GUARANTEED 
> enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 
> application_state: RMAPP_FINISHED finish_time: 1479326359214>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
> {code}
> The test finishes with two asserts. This is the second assert that fails, 
> YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672354#comment-15672354
 ] 

sandflee commented on YARN-5375:


Thanks Rohith,Sunil, Varun for review and commit ! after this path most random 
test failure should be resolved. will open another jira to fix RMRestart and 
MockAM.

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, 
> YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669662#comment-15669662
 ] 

sandflee commented on YARN-5375:


update YARN-5375.12.new.patch to trigger jenkins

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, 
> YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.12.new.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, 
> YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.12.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, 
> YARN-5375.11.patch, YARN-5375.12.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15666843#comment-15666843
 ] 

sandflee commented on YARN-5375:


update the patch to invoke drainEventsImplicitly() at the front of 
waitForState().  and fix a test failure caused by this
{code:title=TestAbstractYarnScheduler.java}
 // AM crashes, and a new app-attempt gets created
  node.nodeHeartbeat(applicationAttemptOneID, 1, ContainerState.COMPLETE);
  rm.waitForState(node, am1ContainerID, RMContainerState.COMPLETED, 30 * 
1000);
  RMAppAttempt rmAppAttempt2 = MockRM.waitForAttemptScheduled(rmApp, rm);
{code}
waitForState will drain all events first leading  completed container not exist 
in new schedulerAppAttempt, and the coresponding waitForState() code will 
invoke node hearbeat, appattempt will be allocated state, the check for 
scheduled state will fail


> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, YARN-5375.11.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.11.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, YARN-5375.11.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-14 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663934#comment-15663934
 ] 

sandflee commented on YARN-5375:


thanks [~rohithsharma], TestTokenClientRMService test failure seems not related 
to this issue, it couldn't run pass locally and is tracked by YARN-5875

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-11-10 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653433#comment-15653433
 ] 

sandflee commented on YARN-5453:


thanks [~kasha] for review and  commit !

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-06 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643429#comment-15643429
 ] 

sandflee commented on YARN-5375:


update the patch to fix TestFairScheduler, adding drainEvents after a node is 
registered. not using a rm.dispatcher.handle() for most TestFairScheduler not 
using this way.

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-11-06 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.09.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-medium
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch, YARN-5375.09.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-11-06 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642979#comment-15642979
 ] 

sandflee commented on YARN-5453:


thanks [~kasha], patch updated.

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-11-06 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5453:
---
Attachment: YARN-5453.05.patch

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5276) print more info when event queue is blocked

2016-11-02 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628654#comment-15628654
 ] 

sandflee commented on YARN-5276:


Thanks [~miklos.szeg...@cloudera.com] for your detailed reply, seems no much  
necessary to add a UT :(

> print more info when event queue is blocked
> ---
>
> Key: YARN-5276
> URL: https://issues.apache.org/jira/browse/YARN-5276
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Attachments: YARN-5276.01.patch, YARN-5276.02.patch, 
> YARN-5276.03.patch, YARN-5276.04.patch
>
>
> we now see logs like "Size of event-queue is 498000, Size of event-queue is 
> 499000" and difficult to know which event flood the queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5276) print more info when event queue is blocked

2016-10-27 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5276:
---
Attachment: YARN-5276.04.patch

> print more info when event queue is blocked
> ---
>
> Key: YARN-5276
> URL: https://issues.apache.org/jira/browse/YARN-5276
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Attachments: YARN-5276.01.patch, YARN-5276.02.patch, 
> YARN-5276.03.patch, YARN-5276.04.patch
>
>
> we now see logs like "Size of event-queue is 498000, Size of event-queue is 
> 499000" and difficult to know which event flood the queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5276) print more info when event queue is blocked

2016-10-27 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613741#comment-15613741
 ] 

sandflee commented on YARN-5276:


Thanks for your comment, the patch just add some log info, I wondered how to 
test it?

> print more info when event queue is blocked
> ---
>
> Key: YARN-5276
> URL: https://issues.apache.org/jira/browse/YARN-5276
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: sandflee
>Assignee: sandflee
>  Labels: oct16-easy
> Attachments: YARN-5276.01.patch, YARN-5276.02.patch, 
> YARN-5276.03.patch
>
>
> we now see logs like "Size of event-queue is 498000, Size of event-queue is 
> 499000" and difficult to know which event flood the queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-10-27 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612189#comment-15612189
 ] 

sandflee commented on YARN-5375:


Thanks [~rohithsharma], yes, MockRM#createSchedulerEventDispatcher triggers the 
deadlock. NodeAddEvent used to left in schedule event queue now get processed. 
agree that using resourcemanager.handle(nodeUpdate) could solve this bug.
and this test seems very special, only rm dispatcher started.


> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-10-25 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605288#comment-15605288
 ] 

sandflee commented on YARN-5375:


update YARN-5375.08.patch to address the comment of [~rohithsharma], and 
1. add MockRMNullStateStore since NullStateStore is mostly used
2. simple fix TestFairScheduler  deadlock

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-10-25 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.08.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, 
> YARN-5375.08.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-10-24 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601173#comment-15601173
 ] 

sandflee commented on YARN-5375:


sorry for the delay, will do this in these days

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5362) TestRMRestart#testFinishedAppRemovalAfterRMRestart can fail

2016-10-08 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559054#comment-15559054
 ] 

sandflee commented on YARN-5362:


thanks [~Naganarasimha], I'll have a look.

> TestRMRestart#testFinishedAppRemovalAfterRMRestart can fail
> ---
>
> Key: YARN-5362
> URL: https://issues.apache.org/jira/browse/YARN-5362
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: sandflee
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-5362.01.patch
>
>
> Saw the following in a precommit build that only changed an unrelated unit 
> test:
> {noformat}
> Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 101.265 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 0.411 sec  <<< FAILURE!
> java.lang.AssertionError: expected null, but 
> was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotNull(Assert.java:664)
>   at org.junit.Assert.assertNull(Assert.java:646)
>   at org.junit.Assert.assertNull(Assert.java:656)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1653)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-10-06 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551422#comment-15551422
 ] 

sandflee commented on YARN-5453:


Hi, [~kasha], patch updated, failed test could run pass locally seems not 
related.

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch, YARN-5453.04.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-10-05 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5453:
---
Attachment: YARN-5453.04.patch

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch, YARN-5453.04.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-10-01 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538216#comment-15538216
 ] 

sandflee commented on YARN-5375:


Thanks [~rohithsharma] for your review !

bq. private volatile boolean drained = true; default value has been changed. 
Would you tell why this change required?
to be consistent with AsyncDispatcher#drained. and seems a default value of 
true is more reasonable.

bq. I think change in the method static to non-static not necessarily required 
in MockRM#waitForState. Lets keep it as it is. As a result, MockAM 
modifications are not at all required.
change from static to non-static is to add drainEventsImplicitly(). if keep it 
as it is, the invoker (MockAM) maybe have to explicitly call rm#drainEvents 

bq. nit: couple of changes which are not modified are appeared in patch. May be 
check those also, else patch looks very huge. Ex : MockRM class, line no 349, 
332
will do

bq. One doubt, if once * disableDrainEventsImplicitly* set then there is no way 
to enable it. Should we provide enabling method also?
couldn't figure out the scene to disable and then enable, but I'm ok to add 
enable method

bq. After this patch, can sleeps can be avoided ? If yes, I think we need to 
remove so that test execute faster.
yes,  after drainEvents, all events are processed, no need to sleep-wait anymore

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-09-26 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522347#comment-15522347
 ] 

sandflee commented on YARN-5375:


I'm ok for both method, any thought?

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5315) Standby RM keep sending start am container request to NM

2016-09-26 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522338#comment-15522338
 ] 

sandflee commented on YARN-5315:


{quote}
shutdownNow :
Attempts to stop all actively executing tasks, halts the processing of waiting 
tasks, and returns a list of the tasks that were awaiting execution. These 
tasks are drained (removed) from the task queue upon return from this method.
{quote}
thanks [~jianhe], shutdownNow will interrupt active workers and drain pending 
task. so the difference is patch1 will not wait active worker terminated but 
patch2 will. seems we couldn't get much benefit from awaitTermination.

> Standby RM keep sending start am container request to NM
> 
>
> Key: YARN-5315
> URL: https://issues.apache.org/jira/browse/YARN-5315
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5315.01.patch, YARN-5315.02.patch
>
>
> 1, network partitions, RM couldn't connect to NMs and start AM request pending
> 2, RM becomes standby, int ApplicatioinMasterLauncher#serviceStop, 
> launcherPool are shutdown. the launching thread are interrupted, but start AM 
> request may still left in Queue
> 3,network reconnect,  standby RM sends start AM request to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-28 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1565#comment-1565
 ] 

sandflee edited comment on YARN-5375 at 8/29/16 1:24 AM:
-

thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest 
!  , update two patches. 
1, add MockRMMemoryStateStore in MockRM
"drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, 
statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works 
for almost all of cases
"sync patch" makes stateStore Event processed in a sync way. so drainEvents 
will drain all events, this will drain some unnessesary events, but seems a 
more general way.
2, accessing DrainDispatcher#drained should be protected by mutex, or there 
will be a race condition.



was (Author: sandflee):
thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest 
!  
1, add MockRMMemoryStateStore,
"drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, 
statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works 
for almost all of cases
"sync patch" makes stateStore Event processed in a sync way. so drainEvents 
will drain all events, this will drain some unnessesary events, but seems a 
more general way.
2, accessing DrainDispatcher#drained should be protected by mutex, or there 
will be a race condition.


> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-28 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1565#comment-1565
 ] 

sandflee commented on YARN-5375:


thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest 
!  
1, add MockRMMemoryStateStore,
"drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, 
statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works 
for almost all of cases
"sync patch" makes stateStore Event processed in a sync way. so drainEvents 
will drain all events, this will drain some unnessesary events, but seems a 
more general way.
2, accessing DrainDispatcher#drained should be protected by mutex, or there 
will be a race condition.


> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-28 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: (was: YARN-5375.07-sync-statestore.patch)

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-28 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.07-sync-statestore.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-28 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.07-drain-statestore.patch
YARN-5375.07-sync-statestore.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, 
> YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-24 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435099#comment-15435099
 ] 

sandflee commented on YARN-5375:


thanks [~rohithsharma],   even if we using the order rmDispatcher.await --> 
stateStoreDispatcher --> rmDispatcher , there are still very very little change 
that events are not processed in stateStoreDispatcher or rmDispatcher, agree? 
 I prefer a accurate way to really drain all events if MockRM#drainEvents 
returned. based on this we could use drainEvents to replace sleep-wait way in 
MockRM#waitForState, this may help to reduce test time, thought?

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-24 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434730#comment-15434730
 ] 

sandflee commented on YARN-5375:


bq. One comment from earlier patch in MockRM is need to take care whenever 
drain for Stastore dispatcher, again need to wait for draining rm-dispatcher. 
This is required because state store trigger another event to rm-dispatcher.
yes, we should take care of this. but double drain drain rm-dispatcher may not 
help, because rm state store may have new Event again. One approach is to add a 
timestamp to drainDispatcher to record newEvent add time, after drain 
StateStore event, we could check the timestamp of rm-disptacher, if not 
updated, we could make sure that all events are drained.  

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-23 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434024#comment-15434024
 ] 

sandflee commented on YARN-5375:


also see YARN-5043,  if StateStore Event is not processed, more likely it will 
produce another RM Event.

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-22 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431785#comment-15431785
 ] 

sandflee commented on YARN-5375:


Thanks [~varun_saxena],  yes this will reduce the change for main class and 
much cleaner, thought? [~sunilg][~rohithsharma]

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-18 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426170#comment-15426170
 ] 

sandflee commented on YARN-5526:


thanks [~varun_saxena] for suggestion and review !

> DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
> --
>
> Key: YARN-5526
> URL: https://issues.apache.org/jira/browse/YARN-5526
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Fix For: 2.9.0
>
> Attachments: YARN-5526.01.patch, YARN-5526.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423746#comment-15423746
 ] 

sandflee commented on YARN-5526:


Thanks [~varun_saxena], that's more reasonable!  update the patch.

> DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
> --
>
> Key: YARN-5526
> URL: https://issues.apache.org/jira/browse/YARN-5526
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5526.01.patch, YARN-5526.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5526:
---
Attachment: YARN-5526.02.patch

> DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
> --
>
> Key: YARN-5526
> URL: https://issues.apache.org/jira/browse/YARN-5526
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5526.01.patch, YARN-5526.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5526:
---
Attachment: YARN-5526.01.patch

> DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
> --
>
> Key: YARN-5526
> URL: https://issues.apache.org/jira/browse/YARN-5526
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5526.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422698#comment-15422698
 ] 

sandflee commented on YARN-5375:


1, replace rmStateStore#AsyncDispatcher with DrainDispatcher, yes the code is 
not clean, welcome suggestions!
2, set DrainDispacher#isDrain default value to true, and sleep a while if not 
drained to reduce cpu usage.
3,  not invoke setDrainEventsOnStop in RMStateStore#DrainDispatcher creation , 
for DrainDispacher will take 300s to stop if enabled setDrainEventsOnStop, file 
YARN-5526 to track.

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422691#comment-15422691
 ] 

sandflee commented on YARN-5526:


DrainDispacher override AsyncDispacher#CreateThead, and AsyncDispacher#drained 
will never be refreshed to true. 

> DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
> --
>
> Key: YARN-5526
> URL: https://issues.apache.org/jira/browse/YARN-5526
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked

2016-08-16 Thread sandflee (JIRA)
sandflee created YARN-5526:
--

 Summary: DrainDispacher#ServiceStop blocked if 
setDrainEventsOnStop invoked
 Key: YARN-5526
 URL: https://issues.apache.org/jira/browse/YARN-5526
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: sandflee
Assignee: sandflee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures

2016-08-16 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5375:
---
Attachment: YARN-5375.06.patch

> invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
> --
>
> Key: YARN-5375
> URL: https://issues.apache.org/jira/browse/YARN-5375
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5375.01.patch, YARN-5375.03.patch, 
> YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch
>
>
> seen many test failures related to RMApp/RMAppattempt comes to some state but 
> some event are not processed in rm event queue or scheduler event queue, 
> cause test failure, seems we could implicitly invokes drainEvents(should also 
> drain sheduler event) in some mockRM method like waitForState



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly

2016-08-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421781#comment-15421781
 ] 

sandflee commented on YARN-5521:


Thanks [~varun_saxena], [~sunilg], [~bibinchundatt]

> TestCapacityScheduler#testKillAllAppsInQueue fails randomly
> ---
>
> Key: YARN-5521
> URL: https://issues.apache.org/jira/browse/YARN-5521
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: sandflee
> Fix For: 2.9.0
>
> Attachments: Failure.txt, YARN-5521.01.patch
>
>
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.146 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188)
> Results :
> Failed tests:
>   TestCapacityScheduler.testKillAllAppsInQueue:2188 null
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly

2016-08-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5521:
---
Attachment: YARN-5521.01.patch

> TestCapacityScheduler#testKillAllAppsInQueue fails randomly
> ---
>
> Key: YARN-5521
> URL: https://issues.apache.org/jira/browse/YARN-5521
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: sandflee
> Attachments: Failure.txt, YARN-5521.01.patch
>
>
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.146 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188)
> Results :
> Failed tests:
>   TestCapacityScheduler.testKillAllAppsInQueue:2188 null
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly

2016-08-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421202#comment-15421202
 ] 

sandflee commented on YARN-5521:


thanks [~bibinchundatt], seems caused by app went to KILLED state, but 
APP_ATTEMPT_REMOVED event not processed by scheduler dispatcher. this could 
simple fixed by 
{code}
rm.waitForState(app.getApplicationId(), RMAppState.KILLED);
rm.waitForAppRemovedFromScheduler(app.getApplicationId());
appsInRoot = scheduler.getAppsInQueue("root");
{code}
and YARN-5375 introduce a more general way , cc [~sunilg] [~rohithsharma]

> TestCapacityScheduler#testKillAllAppsInQueue fails randomly
> ---
>
> Key: YARN-5521
> URL: https://issues.apache.org/jira/browse/YARN-5521
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: sandflee
> Attachments: Failure.txt
>
>
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.146 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188)
> Results :
> Failed tests:
>   TestCapacityScheduler.testKillAllAppsInQueue:2188 null
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly

2016-08-15 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee reassigned YARN-5521:
--

Assignee: sandflee

> TestCapacityScheduler#testKillAllAppsInQueue fails randomly
> ---
>
> Key: YARN-5521
> URL: https://issues.apache.org/jira/browse/YARN-5521
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: sandflee
> Attachments: Failure.txt
>
>
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.146 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188)
> Results :
> Failed tests:
>   TestCapacityScheduler.testKillAllAppsInQueue:2188 null
> Tests run: 49, Failures: 1, Errors: 0, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5479) FairScheduler: Scheduling performance improvement

2016-08-15 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420796#comment-15420796
 ] 

sandflee commented on YARN-5479:


will do, thx

> FairScheduler: Scheduling performance improvement
> -
>
> Key: YARN-5479
> URL: https://issues.apache.org/jira/browse/YARN-5479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: He Tianyi
>Assignee: He Tianyi
>
> Currently ResourceManager uses a single thread to handle async events for 
> scheduling. As number of nodes grows, more events need to be processed in 
> time in FairScheduler. Also, increased number of applications & queues slows 
> down processing of each single event. 
> There are two cases that slow processing of nodeUpdate events is problematic:
> A. global throughput is lower than number of nodes through heartbeat rounds. 
> This keeps resource from being allocated since the inefficiency.
> B. global throughput meets the need, but for some of these rounds, events of 
> some nodes cannot get processed before next heartbeat. This brings 
> inefficiency handling burst requests (i.e. newly submitted MapReduce 
> application cannot get its all task launched soon given enough resource).
> Pretty sure some people will encounter the problem eventually after a single 
> cluster is scaled to several K of nodes (even with {{assignmultiple}} 
> enabled).
> This issue proposes to perform several optimization towards performance in 
> FairScheduler {{nodeUpdate}} method. To be specific:
> A. trading off fairness with efficiency, queue & app sorting can be skipped 
> (or should this be called 'delayed sorting'?). we can either start another 
> dedicated thread to do the sorting & updating, or actually perform sorting 
> after current result have been used several times (say sort once in every 100 
> calls.)
> B. performing calculation on {{Resource}} instances is expensive, since at 
> least 2 objects ({{ResourceImpl}} and its proto builder) is created each time 
> (using 'immutable' apis). the overhead can be eliminated with a 
> light-weighted implementation of Resource, which do not instantiate a builder 
> until necessary, because most instances are used as intermediate result in 
> scheduler instead of being exchanged via IPC. Also, {{createResource}} is 
> using reflection, which can be replaced by a plain {{new}} (for scheduler 
> usage only). furthermore, perhaps we could 'intern' resource to avoid 
> allocation.
> C. other minor changes: such as move {{updateRootMetrics}} call to 
> {{update}}, making root queue metrics eventual consistent (which may 
> satisfies most of the needs). or introduce counters to {{getResourceUsage}} 
> and make changing of resource incrementally instead of recalculate each time.
> With A and B, I was looking at 4 times improvement in a cluster with 2K nodes.
> Suggestions? Opinions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5479) FairScheduler: Scheduling performance improvement

2016-08-13 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419963#comment-15419963
 ] 

sandflee commented on YARN-5479:


seems no need to compute minShare/isNeed/MinShareRatio/UseToWeightRatio in 
every comparator#compute, we could snapshot these before do real sort.

> FairScheduler: Scheduling performance improvement
> -
>
> Key: YARN-5479
> URL: https://issues.apache.org/jira/browse/YARN-5479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: He Tianyi
>Assignee: He Tianyi
>
> Currently ResourceManager uses a single thread to handle async events for 
> scheduling. As number of nodes grows, more events need to be processed in 
> time in FairScheduler. Also, increased number of applications & queues slows 
> down processing of each single event. 
> There are two cases that slow processing of nodeUpdate events is problematic:
> A. global throughput is lower than number of nodes through heartbeat rounds. 
> This keeps resource from being allocated since the inefficiency.
> B. global throughput meets the need, but for some of these rounds, events of 
> some nodes cannot get processed before next heartbeat. This brings 
> inefficiency handling burst requests (i.e. newly submitted MapReduce 
> application cannot get its all task launched soon given enough resource).
> Pretty sure some people will encounter the problem eventually after a single 
> cluster is scaled to several K of nodes (even with {{assignmultiple}} 
> enabled).
> This issue proposes to perform several optimization towards performance in 
> FairScheduler {{nodeUpdate}} method. To be specific:
> A. trading off fairness with efficiency, queue & app sorting can be skipped 
> (or should this be called 'delayed sorting'?). we can either start another 
> dedicated thread to do the sorting & updating, or actually perform sorting 
> after current result have been used several times (say sort once in every 100 
> calls.)
> B. performing calculation on {{Resource}} instances is expensive, since at 
> least 2 objects ({{ResourceImpl}} and its proto builder) is created each time 
> (using 'immutable' apis). the overhead can be eliminated with a 
> light-weighted implementation of Resource, which do not instantiate a builder 
> until necessary, because most instances are used as intermediate result in 
> scheduler instead of being exchanged via IPC. Also, {{createResource}} is 
> using reflection, which can be replaced by a plain {{new}} (for scheduler 
> usage only). furthermore, perhaps we could 'intern' resource to avoid 
> allocation.
> C. other minor changes: such as move {{updateRootMetrics}} call to 
> {{update}}, making root queue metrics eventual consistent (which may 
> satisfies most of the needs). or introduce counters to {{getResourceUsage}} 
> and make changing of resource incrementally instead of recalculate each time.
> With A and B, I was looking at 4 times improvement in a cluster with 2K nodes.
> Suggestions? Opinions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort

2016-08-12 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419674#comment-15419674
 ] 

sandflee commented on YARN-4743:


bq. I think the root cause is NodeAvailableResourceComparator is not transitive.
agree, but do we really need comparator to be transitive?  it may reduce the 
perform greatly.


> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Attachments: YARN-4743-cdh5.4.7.patch
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5393) [Umbrella] Optimize YARN tests runtime

2016-08-12 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419643#comment-15419643
 ] 

sandflee commented on YARN-5393:


YARN-5375 aims to add drainAllEvents to waitForState to reduce test failure,  
also, if all events are really drained, there is no need to sleep any more.

> [Umbrella] Optimize YARN tests runtime 
> ---
>
> Key: YARN-5393
> URL: https://issues.apache.org/jira/browse/YARN-5393
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Vinod Kumar Vavilapalli
>
> When I originally merged MAPREDUCE-279 into Hadoop, *all* of YARN tests used 
> to take 10 mins with pretty good coverage.
> Now only TestRMRestart takes that much time - we'ven't been that great 
> writing pointed - short tests.
> Time for an initiative to optimize YARN tests. And even after that, if it 
> takes too long, we go the MAPREDUCE-670 route.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5483) Optimize RMAppAttempt#pullJustFinishedContainers

2016-08-10 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416202#comment-15416202
 ] 

sandflee commented on YARN-5483:


thanks [~jlowe] [~templedf]  [~jianhe] and [~rohithsharma] for the review and 
commit!

> Optimize RMAppAttempt#pullJustFinishedContainers
> 
>
> Key: YARN-5483
> URL: https://issues.apache.org/jira/browse/YARN-5483
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: sandflee
>Assignee: sandflee
> Fix For: 2.8.0, 2.6.5, 2.7.4
>
> Attachments: YARN-5483-branch-2.6.patch, 
> YARN-5483-branch-2.6.patch.02, YARN-5483-branch-2.7.patch, 
> YARN-5483-branch-2.7.patch.02, YARN-5483.01.patch, YARN-5483.02.patch, 
> YARN-5483.03.patch, YARN-5483.04.patch, jprofiler-cpu.png
>
>
> about 1000 app running on cluster, jprofiler found pullJustFinishedContainers 
> cost too much cpu.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-08-09 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-5453:
---
Attachment: YARN-5453.03.patch

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch, 
> YARN-5453.03.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource

2016-08-09 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414452#comment-15414452
 ] 

sandflee commented on YARN-5453:


agree, update the patch as you suggested.

> FairScheduler#update may skip update demand resource of child queue/app if 
> current demand reached maxResource
> -
>
> Key: YARN-5453
> URL: https://issues.apache.org/jira/browse/YARN-5453
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-5453.01.patch, YARN-5453.02.patch
>
>
> {code}
>   demand = Resources.createResource(0);
>   for (FSQueue childQueue : childQueues) {
> childQueue.updateDemand();
> Resource toAdd = childQueue.getDemand();
> demand = Resources.add(demand, toAdd);
> demand = Resources.componentwiseMin(demand, maxRes);
> if (Resources.equals(demand, maxRes)) {
>   break;
> }
>   }
> {code}
> if one singe queue's demand resource exceed maxRes,  the other queue's demand 
> resource will not update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   >