[jira] [Commented] (YARN-9594) Unknown event arrived at ContainerScheduler: EventType: RECOVERY_COMPLETED

2019-06-10 Thread lujie (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860583#comment-16860583
 ] 

lujie commented on YARN-9594:
-

ping->

> Unknown event arrived at ContainerScheduler: EventType: RECOVERY_COMPLETED
> --
>
> Key: YARN-9594
> URL: https://issues.apache.org/jira/browse/YARN-9594
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Major
> Attachments: YARN-9594_1.patch
>
>
> It seems that we miss a break in switch-case
> {code:java}
> case RECOVERY_COMPLETED:
>   startPendingContainers(maxOppQueueLength <= 0);
>   metrics.setQueuedContainers(queuedOpportunisticContainers.size(),
>  queuedGuaranteedContainers.size());
> //break;missed
> default:
>   LOG.error("Unknown event arrived at ContainerScheduler: "
> + event.toString());
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9612) Support using ip to register NodeID

2019-06-10 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860561#comment-16860561
 ] 

zhoukang commented on YARN-9612:


Thanks [~tangzhankun] I think use service name will make maintenance more 
difficult.

> Support using ip to register NodeID
> ---
>
> Key: YARN-9612
> URL: https://issues.apache.org/jira/browse/YARN-9612
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Priority: Major
>
> In the environment like k8s. We should support ip when register NodeID with 
> RM since the hostname will be podName which can not be be resolved by DNS of 
> k8s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860540#comment-16860540
 ] 

Tao Yang commented on YARN-9598:


Thanks [~cheersyang] for the response. 
{quote}
How can we make sure a big container request not getting starved in such case? 
Maybe a way to improve this is to swap reserved container on NMs
{quote}
I think it should be improved preemption policy to take on this responsibility.
Considering that there is still dispute about re-reservation disabled when 
multi-nodes enabled, perhaps we can suppose that the harmful of re-reservation 
can be ignored for now and the problem can be solved by improved nodes-sorting 
policy? I will remove referenced changes in the patch if no objections.

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-10 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860499#comment-16860499
 ] 

Abhishek Modi commented on YARN-9608:
-

[~wjlei] with this jira we are also tracking applications whose containers ran 
on the node before node was put into Decommissioning state. Previously also 
node life-cycle was dependent on application run time but it was considering 
only those applications whose containers were running on the node when node was 
moved to decommissioning state.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859740#comment-16859740
 ] 

Juanjuan Tian  edited comment on YARN-9598 at 6/11/19 1:58 AM:
---

Hi Tao,
{noformat}
disable re-reservation can only make the scheduler skip reserving the same 
container repeatedly and try to allocate on other nodes, it won't affect normal 
scheduling for this app and later apps. Thoughts?{noformat}
for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
cluster, and two queues A,B, each is configured with 50% capacity.

firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
and each node of the 10 nodes will have a contianer allocated.

Afterwards,  another job JobB which requests 3G resource is submited to queue 
B, and there will be one container with 3G size reserved on node h1, if we 
disable re-reservation, in this case, even scheduler can look up other nodes, 
since the shouldAllocOrReserveNewContainer is false, there is still no other 
reservations, and JobB will still get stuck. 


was (Author: jutia):
Hi Tao,
{noformat}
disable re-reservation can only make the scheduler skip reserving the same 
container repeatedly and try to allocate on other nodes, it won't affect normal 
scheduling for this app and later apps. Thoughts?{noformat}
for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
cluster, and two queues A,B, each is configured with 50% capacity.

firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
and each node of the 10 nodes will have a contianer allocated.

Afterwards,  another job JobB which requests 3G resource is submited to queue 
B, and there will be one container with 3G size reserved on node h1, if we 
disable re-reservation, in this case, even scheduler can look up other nodes, 
since the shouldAllocOrReserveNewContainer is false, there is still on other 
reservations, and JobB will still get stuck. 

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-10 Thread jialei weng (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860470#comment-16860470
 ] 

jialei weng edited comment on YARN-9608 at 6/11/19 1:58 AM:


{color:#33}This solution provides an idea to extend life-cycle of node 
local data to the whole application running time. A small question here, if the 
application is long running job, the node decommission time will also take 
longer? And rely on the time-out? [~abmodi] Please correct me if I 
misunderstand.{color}


was (Author: wjlei):
{color:#33}This solution provides an idea to extend life-cycle of 
{color:#33}node local data to the whole application running time. A small 
question here, if the application is long running job, the node decommission 
time will also take longer? And rely on the time-out? Please correct me if I 
misunderstand.{color}{color}

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-10 Thread jialei weng (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860470#comment-16860470
 ] 

jialei weng commented on YARN-9608:
---

{color:#33}This solution provides an idea to extend life-cycle of 
{color:#33}node local data to the whole application running time. A small 
question here, if the application is long running job, the node decommission 
time will also take longer? And rely on the time-out? Please correct me if I 
misunderstand.{color}{color}

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860417#comment-16860417
 ] 

zhenzhao wang commented on YARN-9616:
-

I had seen this issue in 2.9 and 2.6. More check is needed to identify the 
problem in the latest version.

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-9616:

Affects Version/s: 2.8.3
   2.9.2

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-9616:

Affects Version/s: 2.8.5

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)
zhenzhao wang created YARN-9616:
---

 Summary: Shared Cache Manager Failed To Upload Unpacked Resources
 Key: YARN-9616
 URL: https://issues.apache.org/jira/browse/YARN-9616
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhenzhao wang
Assignee: zhenzhao wang


Yarn will unpack archives files and some other files based on the file type and 
configuration. E.g. 
 If I started an MR job with -archive one.zip, then the one.zip will be 
unpacked while download. Let's say there're file1 && file2 inside one.zip. Then 
the files kept on local disk will be like 
/disk3/yarn/local/filecache/352/one.zip/file1 
and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache uploader 
couldn't upload one.zip to shared cache as it was removed during localization. 
The following errors will be thrown.

{code:java}
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
 Exception while uploading the file dict.zip
java.io.FileNotFoundException: File 
/disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9593) Updating scheduler conf with comma in config value fails

2019-06-10 Thread Anthony Hsu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860396#comment-16860396
 ] 

Anthony Hsu commented on YARN-9593:
---

Thanks, [~jhung]. I don't have bandwidth at the moment, but glad we're agreed 
on the approach. I think this would be a good starter task.

> Updating scheduler conf with comma in config value fails
> 
>
> Key: YARN-9593
> URL: https://issues.apache.org/jira/browse/YARN-9593
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0, 3.2.0, 3.1.2
>Reporter: Anthony Hsu
>Priority: Major
>
> For example:
> {code:java}
> $ yarn schedulerconf -update "root.gridops:acl_administer_queue=user1,user2 
> group1,group2"
> Specify configuration key value as confKey=confVal.{code}
> This fails because there is a comma in the config value and the SchedConfCLI 
> splits on comma first, expecting each split to a k=v pair.
> {noformat}
> void globalUpdates(String args, SchedConfUpdateInfo updateInfo) {
>   if (args == null) {
> return;
>   }
>   HashMap globalUpdates = new HashMap<>();
>   for (String globalUpdate : args.split(",")) {
> putKeyValuePair(globalUpdates, globalUpdate);
>   }
>   updateInfo.setGlobalParams(globalUpdates);
> }{noformat}
> Cc: [~jhung]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9569) Auto-created leaf queues do not honor cluster-wide min/max memory/vcores

2019-06-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860372#comment-16860372
 ] 

Hudson commented on YARN-9569:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16714 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16714/])
YARN-9569. Auto-created leaf queues do not honor cluster-wide min/max (sumasai: 
rev 9191e08f0ad4ebc2a3b776c4cc71d0fc5c053beb)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerAutoCreatedQueueBase.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerAutoQueueCreation.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractManagedParentQueue.java


> Auto-created leaf queues do not honor cluster-wide min/max memory/vcores
> 
>
> Key: YARN-9569
> URL: https://issues.apache.org/jira/browse/YARN-9569
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 3.2.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
> Attachments: YARN-9569.001.patch, YARN-9569.002.patch
>
>
> Auto-created leaf queues do not honor cluster-wide settings for maximum 
> CPU/vcores allocation.
> To reproduce:
>  # Set auto-create-child-queue.enabled=true for a parent queue.
>  # Set leaf-queue-template.maximum-allocation-mb=16384.
>  # Set yarn.resource-types.memory-mb.maximum-allocation=16384 in 
> resource-types.xml
>  # Launch a YARN app with a container requesting 16 GB RAM.
>  
> This scenario should work, but instead you get an error similar to this:
> {code:java}
> java.lang.IllegalArgumentException: Queue maximum allocation cannot be larger 
> than the cluster setting for queue root.auto.test max allocation per queue: 
>  cluster setting:    {code}
>  
> This seems to be caused by this code in 
> ManagedParentQueue.getLeafQueueConfigs:
> {code:java}
> CapacitySchedulerConfiguration leafQueueConfigTemplate = new
> CapacitySchedulerConfiguration(new Configuration(false), false);{code}
>  
> This initializes a new leaf queue configuration that does not read 
> resource-types.xml (or any other config). Later, this 
> CapacitySchedulerConfiguration instance calls 
> ResourceUtils.fetchMaximumAllocationFromConfig()  from its 
> getMaximumAllocationPerQueue() method and passes itself as the configuration 
> to use. Since the resource types are not present, ResourceUtils falls back to 
> compiled-in defaults of 8GB RAM, 4 cores.
>  
> I was able to work around this with a custom AutoCreatedQueueManagementPolicy 
> implementation which does something like this in init() and reinitialize():
> {code:java}
> for (Map.Entry entry : this.scheduler.getConfiguration()) {
> if (entry.getKey().startsWith("yarn.resource-types")) {
>   parentQueue.getLeafQueueTemplate().getLeafQueueConfigs()
> .set(entry.getKey(), entry.getValue());
>   }
> }
> {code}
> However, this is obviously a very hacky way to solve the problem.
> I can submit a proper patch if someone can provide some direction as to the 
> best way to proceed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9569) Auto-created leaf queues do not honor cluster-wide min/max memory/vcores

2019-06-10 Thread Suma Shivaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860349#comment-16860349
 ] 

Suma Shivaprasad commented on YARN-9569:


Committed to trunk. Thanks [~ccondit]

> Auto-created leaf queues do not honor cluster-wide min/max memory/vcores
> 
>
> Key: YARN-9569
> URL: https://issues.apache.org/jira/browse/YARN-9569
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 3.2.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
> Attachments: YARN-9569.001.patch, YARN-9569.002.patch
>
>
> Auto-created leaf queues do not honor cluster-wide settings for maximum 
> CPU/vcores allocation.
> To reproduce:
>  # Set auto-create-child-queue.enabled=true for a parent queue.
>  # Set leaf-queue-template.maximum-allocation-mb=16384.
>  # Set yarn.resource-types.memory-mb.maximum-allocation=16384 in 
> resource-types.xml
>  # Launch a YARN app with a container requesting 16 GB RAM.
>  
> This scenario should work, but instead you get an error similar to this:
> {code:java}
> java.lang.IllegalArgumentException: Queue maximum allocation cannot be larger 
> than the cluster setting for queue root.auto.test max allocation per queue: 
>  cluster setting:    {code}
>  
> This seems to be caused by this code in 
> ManagedParentQueue.getLeafQueueConfigs:
> {code:java}
> CapacitySchedulerConfiguration leafQueueConfigTemplate = new
> CapacitySchedulerConfiguration(new Configuration(false), false);{code}
>  
> This initializes a new leaf queue configuration that does not read 
> resource-types.xml (or any other config). Later, this 
> CapacitySchedulerConfiguration instance calls 
> ResourceUtils.fetchMaximumAllocationFromConfig()  from its 
> getMaximumAllocationPerQueue() method and passes itself as the configuration 
> to use. Since the resource types are not present, ResourceUtils falls back to 
> compiled-in defaults of 8GB RAM, 4 cores.
>  
> I was able to work around this with a custom AutoCreatedQueueManagementPolicy 
> implementation which does something like this in init() and reinitialize():
> {code:java}
> for (Map.Entry entry : this.scheduler.getConfiguration()) {
> if (entry.getKey().startsWith("yarn.resource-types")) {
>   parentQueue.getLeafQueueTemplate().getLeafQueueConfigs()
> .set(entry.getKey(), entry.getValue());
>   }
> }
> {code}
> However, this is obviously a very hacky way to solve the problem.
> I can submit a proper patch if someone can provide some direction as to the 
> best way to proceed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9613) Avoid remote lookups for RegistryDNS domain

2019-06-10 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860316#comment-16860316
 ] 

Eric Yang commented on YARN-9613:
-

[~billie.rinaldi] If RegistryDNS is designated as DNS authoritative server for 
a domain, then RegistryDNS doesn't need to perform forward lookup for the 
records in the Hadoop domain.  I think we can introduce 
hadoop.registry.dns.soa.lookup=false.  If this option is set to true, 
RegistryDNS will perform upstream lookup for queries within 
hadoop.registry.dns.domain-name.

> Avoid remote lookups for RegistryDNS domain
> ---
>
> Key: YARN-9613
> URL: https://issues.apache.org/jira/browse/YARN-9613
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: Billie Rinaldi
>Priority: Major
>
> A typical setup for RegistryDNS is for an upstream DNS server to forward DNS 
> queries matching the hadoop.registry.dns.domain-name to RegistryDNS. If the 
> RegistryDNS lookup gets a non-zero DNS RCODE, RegistryDNS performs a remote 
> lookup in upstream DNS servers. For bad queries, this can result in a loop 
> when the upstream DNS server forwards the query back to RegistryDNS.
> To solve this problem, we should avoid performing remote lookups for queries 
> within hadoop.registry.dns.domain-name, which are expected to be handled by 
> RegistryDNS. We may also want to evaluate whether we should add a 
> configuration property that allows the user to disable remote lookups 
> entirely for RegistryDNS, for installations where RegistryDNS is set up as 
> the last DNS server in a chain of DNS servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9471) Cleanup in TestLogAggregationIndexFileController

2019-06-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860269#comment-16860269
 ] 

Hudson commented on YARN-9471:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16711 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16711/])
YARN-9471. Cleanup in TestLogAggregationIndexFileController. Contributed 
(weichiu: rev e94e6435842c5b9dc0f5fe681e0829d33dd5b24e)
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/TestLogAggregationIndexFileController.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/TestLogAggregationIndexedFileController.java


> Cleanup in TestLogAggregationIndexFileController
> 
>
> Key: YARN-9471
> URL: https://issues.apache.org/jira/browse/YARN-9471
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9471.001.patch, YARN-9471.002.patch
>
>
> {{TestLogAggregationIndexFileController}} class can be cleaned up a bit:
> - bad javadoc link
> - should be renamed to TestLogAggregationIndex *ed* FileController
> - some private class members can be removed
> - static fields from Assert can be imported
> - {{StringBuilder}} can be removed from {{logMessage}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9615) Add dispatcher metrics to RM

2019-06-10 Thread Jonathan Hung (JIRA)
Jonathan Hung created YARN-9615:
---

 Summary: Add dispatcher metrics to RM
 Key: YARN-9615
 URL: https://issues.apache.org/jira/browse/YARN-9615
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Jonathan Hung
Assignee: Jonathan Hung


It'd be good to have counts/processing times for each event type in RM async 
dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9593) Updating scheduler conf with comma in config value fails

2019-06-10 Thread Jonathan Hung (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860209#comment-16860209
 ] 

Jonathan Hung commented on YARN-9593:
-

Yeah [~erwaman], that seems reasonable. Are you interested in taking this up?

> Updating scheduler conf with comma in config value fails
> 
>
> Key: YARN-9593
> URL: https://issues.apache.org/jira/browse/YARN-9593
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0, 3.2.0, 3.1.2
>Reporter: Anthony Hsu
>Priority: Major
>
> For example:
> {code:java}
> $ yarn schedulerconf -update "root.gridops:acl_administer_queue=user1,user2 
> group1,group2"
> Specify configuration key value as confKey=confVal.{code}
> This fails because there is a comma in the config value and the SchedConfCLI 
> splits on comma first, expecting each split to a k=v pair.
> {noformat}
> void globalUpdates(String args, SchedConfUpdateInfo updateInfo) {
>   if (args == null) {
> return;
>   }
>   HashMap globalUpdates = new HashMap<>();
>   for (String globalUpdate : args.split(",")) {
> putKeyValuePair(globalUpdates, globalUpdate);
>   }
>   updateInfo.setGlobalParams(globalUpdates);
> }{noformat}
> Cc: [~jhung]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9614) Support configurable container hostname formats for YARN services

2019-06-10 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created YARN-9614:


 Summary: Support configurable container hostname formats for YARN 
services
 Key: YARN-9614
 URL: https://issues.apache.org/jira/browse/YARN-9614
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Billie Rinaldi


The hostname format used by YARN services is currently 
instance.service.user.domain. We could allow this hostname format to be 
configurable (with some restrictions).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9613) Avoid remote lookups for RegistryDNS domain

2019-06-10 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created YARN-9613:


 Summary: Avoid remote lookups for RegistryDNS domain
 Key: YARN-9613
 URL: https://issues.apache.org/jira/browse/YARN-9613
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: Billie Rinaldi


A typical setup for RegistryDNS is for an upstream DNS server to forward DNS 
queries matching the hadoop.registry.dns.domain-name to RegistryDNS. If the 
RegistryDNS lookup gets a non-zero DNS RCODE, RegistryDNS performs a remote 
lookup in upstream DNS servers. For bad queries, this can result in a loop when 
the upstream DNS server forwards the query back to RegistryDNS.

To solve this problem, we should avoid performing remote lookups for queries 
within hadoop.registry.dns.domain-name, which are expected to be handled by 
RegistryDNS. We may also want to evaluate whether we should add a configuration 
property that allows the user to disable remote lookups entirely for 
RegistryDNS, for installations where RegistryDNS is set up as the last DNS 
server in a chain of DNS servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860157#comment-16860157
 ] 

Weiwei Yang commented on YARN-9598:
---

Thanks for bringing this up and the discussions. It looks like the discussion 
goes diverse somehow. Let's make sure we understand the problem we want to 
resolve here.

If I understand correctly, [~jutia] was observing the issue that 
re-reservations are made on a single node because the policy always returns the 
same order. Actually, this is not the only issue, this policy may cause 
hot-spot node when multiple threads put allocations to same ordered nodes. I 
think we need to improve the policy, one possible solution like I previously 
commented, shuffle nodes per score-range. BTW, [~jutia], are you using this 
policy already in your cluster?  

The issue [~Tao Yang] raised is also valid, re-reservations were done by a lot 
of small asks happening on lots of nodes (when the cluster is busy), it will 
cause big players to be starving. This issue should be reproducible with SLS. I 
did a quick look at the patch [~Tao Yang] uploaded, but I also have the concern 
to disable re-reservation. How can we make sure a big container request not 
getting starved in such case? Maybe a way to improve this is to swap reserved 
container on NMs, e.g if a container is already reserved on somewhere else, 
then we can swap this spot with another bigger container that has no 
reservation yet. Just a random thought.

 

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-06-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859968#comment-16859968
 ] 

Hadoop QA commented on YARN-9605:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
27s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 31s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
58s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
26s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
32s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  4m 
37s{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red}  4m 37s{color} | 
{color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  4m 37s{color} 
| {color:red} root in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 58s{color} | {color:orange} root: The patch generated 22 new + 22 unchanged 
- 0 fixed = 44 total (was 22) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
35s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  2m 
18s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
34s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m  
8s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
47s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 36s{color} 
| {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 83m  
0s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}179m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9605 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971310/YARN-9605.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  

[jira] [Commented] (YARN-9471) Cleanup in TestLogAggregationIndexFileController

2019-06-10 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859965#comment-16859965
 ] 

Szilard Nemeth commented on YARN-9471:
--

Hi [~adam.antal]!
Thanks for this patch! I really like these kind of test-cleanup refactors.
+1 (non-binding) for the latest patch!

> Cleanup in TestLogAggregationIndexFileController
> 
>
> Key: YARN-9471
> URL: https://issues.apache.org/jira/browse/YARN-9471
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9471.001.patch, YARN-9471.002.patch
>
>
> {{TestLogAggregationIndexFileController}} class can be cleaned up a bit:
> - bad javadoc link
> - should be renamed to TestLogAggregationIndex *ed* FileController
> - some private class members can be removed
> - static fields from Assert can be imported
> - {{StringBuilder}} can be removed from {{logMessage}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8499) ATS v2 Generic TimelineStorageMonitor

2019-06-10 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859957#comment-16859957
 ] 

Szilard Nemeth commented on YARN-8499:
--

Hi [~Prabhu Joseph]!
Checked your changes with 012 patch.
+1 (non-binding) for the latest patch! Thanks!

> ATS v2 Generic TimelineStorageMonitor
> -
>
> Key: YARN-8499
> URL: https://issues.apache.org/jira/browse/YARN-8499
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Sunil Govindan
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: atsv2
> Attachments: YARN-8499-001.patch, YARN-8499-002.patch, 
> YARN-8499-003.patch, YARN-8499-004.patch, YARN-8499-005.patch, 
> YARN-8499-006.patch, YARN-8499-007.patch, YARN-8499-008.patch, 
> YARN-8499-009.patch, YARN-8499-010.patch, YARN-8499-011.patch, 
> YARN-8499-012.patch
>
>
> Post YARN-8302, Hbase connection issues are handled in ATSv2. However this 
> could be made general by introducing an api in storage interface and 
> implementing in each of the storage as per the store semantics.
>  
> cc [~rohithsharma] [~vinodkv] [~vrushalic]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9537) Add configuration to disable AM preemption

2019-06-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859934#comment-16859934
 ] 

Hadoop QA commented on YARN-9537:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 21s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  1s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 83m 41s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}133m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppAttempt |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9537 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971304/YARN-9537.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 01e6f9598133 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / fcfe7a3 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/24253/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 

[jira] [Updated] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9611:

Component/s: test

> ApplicationHistoryServer related testcases failing
> --
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch, YARN-9611-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> 

[jira] [Commented] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859929#comment-16859929
 ] 

Prabhu Joseph commented on YARN-9611:
-

[~eyang] Can you review this Jira when you get time. This fixes the failing 
testcases related to ApplicationHistoryServer after HADOOP-16314. 

> ApplicationHistoryServer related testcases failing
> --
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch, YARN-9611-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> 

[jira] [Commented] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859897#comment-16859897
 ] 

Hadoop QA commented on YARN-9611:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 50s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 21s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
5s{color} | {color:green} hadoop-yarn-server-tests in the patch passed. {color} 
|
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 53m 26s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9611 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971312/YARN-9611-001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 835db4efc63b 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / fcfe7a3 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24256/testReport/ |
| Max. process+thread count | 615 (vs. ulimit of 1) |
| modules | 

[jira] [Commented] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859890#comment-16859890
 ] 

Hadoop QA commented on YARN-9611:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 35s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 10s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: 
The patch generated 1 new + 28 unchanged - 0 fixed = 29 total (was 28) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 56s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
59s{color} | {color:green} hadoop-yarn-server-tests in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 45m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9611 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12971296/MAPREDUCE-7217-001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 29b114b1b166 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / fcfe7a3 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| checkstyle | 

[jira] [Commented] (YARN-9612) Support using ip to register NodeID

2019-06-10 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859883#comment-16859883
 ] 

Zhankun Tang commented on YARN-9612:


[~cane], thanks for mentioning this. Per my understanding, the RM pod's service 
name in k8s can be used to register the NM?

> Support using ip to register NodeID
> ---
>
> Key: YARN-9612
> URL: https://issues.apache.org/jira/browse/YARN-9612
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Priority: Major
>
> In the environment like k8s. We should support ip when register NodeID with 
> RM since the hostname will be podName which can not be be resolved by DNS of 
> k8s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9612) Support using ip to register NodeID

2019-06-10 Thread zhoukang (JIRA)
zhoukang created YARN-9612:
--

 Summary: Support using ip to register NodeID
 Key: YARN-9612
 URL: https://issues.apache.org/jira/browse/YARN-9612
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang


In the environment like k8s. We should support ip when register NodeID with RM 
since the hostname will be podName which can not be be resolved by DNS of k8s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9611:

Attachment: YARN-9611-001.patch

> ApplicationHistoryServer related testcases failing
> --
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch, YARN-9611-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> 

[jira] [Updated] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9611:

Component/s: timelineserver

> ApplicationHistoryServer related testcases failing
> --
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> 

[jira] [Updated] (YARN-9611) ApplicationHistoryServer related testcases failing

2019-06-10 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9611:

Summary: ApplicationHistoryServer related testcases failing  (was: 
TestMRTimelineEventHandling.testMRTimelineEventHandling fails)

> ApplicationHistoryServer related testcases failing
> --
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> 

[jira] [Moved] (YARN-9611) TestMRTimelineEventHandling.testMRTimelineEventHandling fails

2019-06-10 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph moved MAPREDUCE-7217 to YARN-9611:


Affects Version/s: (was: 3.3.0)
   3.3.0
  Key: YARN-9611  (was: MAPREDUCE-7217)
  Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> TestMRTimelineEventHandling.testMRTimelineEventHandling fails
> -
>
> Key: YARN-9611
> URL: https://issues.apache.org/jira/browse/YARN-9611
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7217-001.patch
>
>
> *TestMRTimelineEventHandling.testMRTimelineEventHandling fails.*
> {code:java}
> ERROR] 
> testMRTimelineEventHandling(org.apache.hadoop.mapred.TestMRTimelineEventHandling)
>   Time elapsed: 46.337 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[AM_STAR]TED> but was:<[JOB_SUBMIT]TED>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> *TestJobHistoryEventHandler.testTimelineEventHandling* 
> {code}
> [ERROR] 
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 5.858 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:597)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> 

[jira] [Updated] (YARN-9537) Add configuration to disable AM preemption

2019-06-10 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9537:
---
Attachment: (was: YARN-9537.001.patch)

> Add configuration to disable AM preemption
> --
>
> Key: YARN-9537
> URL: https://issues.apache.org/jira/browse/YARN-9537
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: zhoukang
>Priority: Major
>
> In this issue, i will add a configuration to support disable AM preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9537) Add configuration to disable AM preemption

2019-06-10 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-9537:
---
Attachment: YARN-9537.001.patch

> Add configuration to disable AM preemption
> --
>
> Key: YARN-9537
> URL: https://issues.apache.org/jira/browse/YARN-9537
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: zhoukang
>Priority: Major
> Attachments: YARN-9537.001.patch
>
>
> In this issue, i will add a configuration to support disable AM preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859810#comment-16859810
 ] 

Tao Yang edited comment on YARN-9598 at 6/10/19 8:14 AM:
-

As I commented 
[above|https://issues.apache.org/jira/browse/YARN-9598?focusedCommentId=16859709=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16859709],
 re-reservation is harmful in multi-nodes scenarios, it can make a low-priority 
app get much more resources than needs which won't be released util all the 
needs satisfied, it's inefficient for the cluster utilization and can block 
requirements from high-priority apps.
I think we should have a further discuss about this, a simple way is to add a 
configuration to control enable/disable which can be decided by users 
themselves, and a node-sorting policy which can put nodes with reserved 
containers in the back of sorting nodes is needed if re-reservation enabled. 
Thoughts? 
cc: [~cheersyang]


was (Author: tao yang):
As I commented 
[above|https://issues.apache.org/jira/browse/YARN-9598?focusedCommentId=16859709=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16859709],
 re-reservation is harmful in multi-nodes scenarios, it can make a low-priority 
app get much more resources than needs which won't be released util all the 
needs satisfied, it's inefficient for the cluster utilization and can block 
requirements from high-priority apps.
I think we should have a further discuss about this, a simple way is to add a 
configuration to control enable/disable which can be decided by users 
themselves, and a node-sorting policy which can put nodes with reserved 
containers in the back of sorting nodes if re-reservation enabled. Thoughts? 
cc: [~cheersyang]

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859810#comment-16859810
 ] 

Tao Yang commented on YARN-9598:


As I commented 
[above|https://issues.apache.org/jira/browse/YARN-9598?focusedCommentId=16859709=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16859709],
 re-reservation is harmful in multi-nodes scenarios, it can make a low-priority 
app get much more resources than needs which won't be released util all the 
needs satisfied, it's inefficient for the cluster utilization and can block 
requirements from high-priority apps.
I think we should have a further discuss about this, a simple way is to add a 
configuration to control enable/disable which can be decided by users 
themselves, and a node-sorting policy which can put nodes with reserved 
containers in the back of sorting nodes if re-reservation enabled. Thoughts? 
cc: [~cheersyang]

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859799#comment-16859799
 ] 

Juanjuan Tian  commented on YARN-9598:
--

   "inter-queue preemption can't happened because of resource fragmentation 
while cluster resource still have 20GB available memory, right?" I will think 
the answer is yes. 

I agree "it's not re-reservation's business but can be worked around by it".  
re-reservation can results in many reservation on many nodes, and then finally 
trigger preemption, it's a workround for preemption not smart enough. So I 
think we should reconsider the re-reservation logic in this patch.

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859790#comment-16859790
 ] 

Tao Yang edited comment on YARN-9598 at 6/10/19 7:38 AM:
-

It's weird to hear that preemption should depends on excess reservations. 
I think inter-queue preemption can't happened because of resource fragmentation 
while cluster resource still have 20GB available memory, right? That's indeed a 
problem in current preemption logic of community. If it is, I think it's not 
re-reservation's business but can be worked around by it, and re-reservation 
may hardly help for this in a large cluster.


was (Author: tao yang):
It's weird to hear that preemption should depends on excess reservations. 
I think inter-queue preemption can't happened because of resource fragmentation 
while cluster resource still have 20GB available memory, right? That's indeed a 
problem in current preemption logic of community. If it is, I think it's no 
re-reservation's business but can be worked around by it, and re-reservation 
may hardly help for this in a large cluster.

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859790#comment-16859790
 ] 

Tao Yang commented on YARN-9598:


It's weird to hear that preemption should depends on excess reservations. 
I think inter-queue preemption can't happened because of resource fragmentation 
while cluster resource still have 20GB available memory, right? That's indeed a 
problem in current preemption logic of community. If it is, I think it's no 
re-reservation's business but can be worked around by it, and re-reservation 
may hardly help for this in a large cluster.

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859776#comment-16859776
 ] 

Juanjuan Tian  commented on YARN-9598:
--

Hi, [~Tao Yang], just like you said,  there will always be one reserved 
container when re-reservation disabled, and thus even when inter-queue 
preemption is enabled in cluster, preemption will not happen. But if we can 
reseve several containers, preemption can be triggered 
(yarn.resourcemanager.monitor.capacity.preemption.additional_res_balance_based_on_reserved_containers
 is set to true )

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859760#comment-16859760
 ] 

Tao Yang commented on YARN-9598:


Hi, [~jutia].
In your example, queue A have been allocated 60GB and left only 2GB on every 
node, when queue B need a 3GB container, scheduler may reserve one container on 
one node. It sounds unrelated to whether re-reservation is enabled, I think 
it's about resource fragmentation and a simple way to solve the problem is 
inter-queue preemption. If inter-queue preemption is disabled in your cluster, 
there may be several reserved containers after many rounds of scheduling 
process when re-reservation enable and there will always be one reserved 
container when re-reservation disabled, that's the main difference between them 
and there will be an allocation and reserved container will be unreserved or 
fulfilled when someone node has enough resource (for example container 
allocated on it just finished).
Please correct me if I was wrong.

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859689#comment-16859689
 ] 

Juanjuan Tian  edited comment on YARN-9598 at 6/10/19 6:45 AM:
---

Hi Tao, 
 # As discussed in YARN-9576, re-reservation proposal may be always generated 
on the same node and break the scheduling for this app and later apps. I think 
re-reservation is unnecessary and we can replace it with LOCALITY_SKIPPED to 
let scheduler have a chance to look up follow candidates for this app when 
multi-node enabled.                

           for this, if re-reservation is disabled, the 
shouldAllocOrReserveNewContainer may return false in most cases, and thus even 
scheduler has a change to look up other candidates, it may not assign 
containers.

   2.  After this patch, since Assignment returned by 
FiCaSchedulerApp#assignContainers could never be null even if it's just 
skipped, thus, even only one of the candidates has been reserved for a 
contianer, the allocateFromReservedContainer will still never be null, it still 
breaks normal scheduler process.

So I'm wondering if we can just handle this case like sing-node, and change th 
logic in CapacityScheduler#allocateContainersOnMultiNodes{color:#d04437} like 
below{color}

     !image-2019-06-10-11-37-44-975.png!

   

/*
 * New behavior, allocate containers considering multiple nodes
 */
 private CSAssignment allocateContainersOnMultiNodes(
 {color:#d04437}FiCaSchedulerNode schedulerNode{color}) {

// Backward compatible way to make sure previous behavior which allocation
 // driven by node heartbeat works.
 if (getNode(schedulerNode.getNodeID()) != schedulerNode)

{ LOG.error("Trying to schedule on a removed node, please double check."); 
return null; }

// Assign new containers...
 // 1. Check for reserved applications
 // 2. Schedule if there are no reservations
 RMContainer reservedRMContainer = schedulerNode.getReservedContainer();
 {color:#d04437}if (reservedRMContainer != null) {{color}
 allocateFromReservedContainer(schedulerNode, false, reservedRMContainer);
 }

// Do not schedule if there are any reservations to fulfill on the node
 if (schedulerNode.getReservedContainer() != null) {
 if (LOG.isDebugEnabled())

{ LOG.debug("Skipping scheduling since node " + schedulerNode.getNodeID() + " 
is reserved by application " + schedulerNode.getReservedContainer() 
.getContainerId().getApplicationAttemptId()); }

return null;
 }

{color:#d04437}PlacementSet ps = 
getCandidateNodeSet(schedulerNode);{color}

// When this time look at multiple nodes, try schedule if the
 // partition has any available resource or killable resource
 if (getRootQueue().getQueueCapacities().getUsedCapacity(
 ps.getPartition()) >= 1.0f && preemptionManager.getKillableResource(
 CapacitySchedulerConfiguration.ROOT, ps.getPartition()) == Resources
 .none()) {

 

 


was (Author: jutia):
Hi Tao, 
 # As discussed in YARN-9576, re-reservation proposal may be always generated 
on the same node and break the scheduling for this app and later apps. I think 
re-reservation is unnecessary and we can replace it with LOCALITY_SKIPPED to 
let scheduler have a chance to look up follow candidates for this app when 
multi-node enabled.                

           for this, if re-reservation is disabled, the 
shouldAllocOrReserveNewContainer may return false in most cases, and thus even 
scheduler has a change to look up other candidates, it may not assign 
containers.

   2.  After this patch, since Assignment returned by 
FiCaSchedulerApp#assignContainers could never be null even if it's just 
skipped, thus, even only one of the candidates has been reserved for a 
contianer, the allocateFromReservedContainer will still never be null, it still 
breaks normal scheduler process.

So I'm wondering why we just handle this case like sing-node, and change th 
logic in CapacityScheduler#allocateContainersOnMultiNodes{color:#d04437} like 
below{color}

     !image-2019-06-10-11-37-44-975.png!

   

/*
 * New behavior, allocate containers considering multiple nodes
 */
 private CSAssignment allocateContainersOnMultiNodes(
 {color:#d04437}FiCaSchedulerNode schedulerNode{color}) {

// Backward compatible way to make sure previous behavior which allocation
 // driven by node heartbeat works.
 if (getNode(schedulerNode.getNodeID()) != schedulerNode)

{ LOG.error("Trying to schedule on a removed node, please double check."); 
return null; }

// Assign new containers...
 // 1. Check for reserved applications
 // 2. Schedule if there are no reservations
 RMContainer reservedRMContainer = schedulerNode.getReservedContainer();
 {color:#d04437}if (reservedRMContainer != null) {{color}
 allocateFromReservedContainer(schedulerNode, false, reservedRMContainer);
 }

// Do not schedule if there are any reservations to fulfill on the node
 if (schedulerNode.getReservedContainer() != null) {
 if 

[jira] [Comment Edited] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859740#comment-16859740
 ] 

Juanjuan Tian  edited comment on YARN-9598 at 6/10/19 6:36 AM:
---

Hi Tao,
{noformat}
disable re-reservation can only make the scheduler skip reserving the same 
container repeatedly and try to allocate on other nodes, it won't affect normal 
scheduling for this app and later apps. Thoughts?{noformat}
for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
cluster, and two queues A,B, each is configured with 50% capacity.

firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
and each node of the 10 nodes will have a contianer allocated.

Afterwards,  another job JobB which requests 3G resource is submited to queue 
B, and there will be one container with 3G size reserved on node h1, if we 
disable re-reservation, in this case, even scheduler can look up other nodes, 
since the shouldAllocOrReserveNewContainer is false, there is still on other 
reservations, and JobB will still get stuck. 


was (Author: jutia):
Hi Tao,

{ }

disable re-reservation can only make the scheduler skip reserving the same 
container repeatedly and try to allocate on other nodes, it won't affect normal 
scheduling for this app and later apps. Thoughts?

{}

 

for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
cluster, and two queues A,B, each is configured with 50% capacity.

firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
and each node of the 10 nodes will have a contianer allocated.

Afterwards,  another job JobB which requests 3G resource is submited to queue 
B, and there will be one container with 3G size reserved on node h1

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Juanjuan Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859740#comment-16859740
 ] 

Juanjuan Tian  commented on YARN-9598:
--

Hi Tao,

{ }

disable re-reservation can only make the scheduler skip reserving the same 
container repeatedly and try to allocate on other nodes, it won't affect normal 
scheduling for this app and later apps. Thoughts?

{}

 

for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
cluster, and two queues A,B, each is configured with 50% capacity.

firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
and each node of the 10 nodes will have a contianer allocated.

Afterwards,  another job JobB which requests 3G resource is submited to queue 
B, and there will be one container with 3G size reserved on node h1

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9608) DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2019-06-10 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859738#comment-16859738
 ] 

Abhishek Modi commented on YARN-9608:
-

[~subru] [~elgoiri] [~giovanni.fumarola] could you please review it. Thanks.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9608.001.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9610) HeartbeatCallBack int FederationInterceptor clear AMRMToken in response from UAM should before add to aysncResponseSink

2019-06-10 Thread Morty Zhong (JIRA)
Morty Zhong created YARN-9610:
-

 Summary: HeartbeatCallBack int FederationInterceptor clear 
AMRMToken in response from UAM should before add to aysncResponseSink 
 Key: YARN-9610
 URL: https://issues.apache.org/jira/browse/YARN-9610
 Project: Hadoop YARN
  Issue Type: Bug
  Components: amrmproxy, federation
Affects Versions: 3.1.2
 Environment: in federation, `allocate` is async. the response from RM 
is cached in `asyncResponseSink`.

the final allocate response is merged from all RMs allocate response. merge 
will throw exception when AMRMToken from UAM response is not null.

But set AMRMToken from UAM response to null is not in the scope of lock. so 
there will be a change merge see that  AMRMToken from UAM response is not null.

so we should clear the token before add response to asyncResponseSink

 

 
{code:java}
synchronized (asyncResponseSink) {
  List responses = null;
  if (asyncResponseSink.containsKey(subClusterId)) {
responses = asyncResponseSink.get(subClusterId);
  } else {
responses = new ArrayList<>();
asyncResponseSink.put(subClusterId, responses);
  }
  responses.add(response);
  // Notify main thread about the response arrival
  asyncResponseSink.notifyAll();
}
...
if (this.isUAM && response.getAMRMToken() != null) {
  Token newToken = ConverterUtils
  .convertFromYarn(response.getAMRMToken(), (Text) null);
  // Do not further propagate the new amrmToken for UAM
  response.setAMRMToken(null);
...{code}
Reporter: Morty Zhong






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org