[jira] [Commented] (YARN-1151) Ability to configure auxiliary services from HDFS-based JAR files

2021-03-16 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302759#comment-17302759
 ] 

Haibo Chen commented on YARN-1151:
--

cherry-picked to branch-2.10 

> Ability to configure auxiliary services from HDFS-based JAR files
> -
>
> Key: YARN-1151
> URL: https://issues.apache.org/jira/browse/YARN-1151
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.1.0-beta, 2.9.0
>Reporter: john lilley
>Assignee: Xuan Gong
>Priority: Major
>  Labels: auxiliary-service, yarn
> Fix For: 3.2.0, 3.1.1, 2.10.2
>
> Attachments: YARN-1151.1.patch, YARN-1151.2.patch, YARN-1151.3.patch, 
> YARN-1151.4.patch, YARN-1151.5.patch, YARN-1151.6.patch, 
> YARN-1151.branch-2.poc.2.patch, YARN-1151.branch-2.poc.3.patch, 
> YARN-1151.branch-2.poc.patch, [YARN-1151] [Design] Configure auxiliary 
> services from HDFS-based JAR files.pdf
>
>
> I would like to install an auxiliary service in Hadoop YARN without actually 
> installing files/services on every node in the system.  Discussions on the 
> user@ list indicate that this is not easily done.  The reason we want an 
> auxiliary service is that our application has some persistent-data components 
> that are not appropriate for HDFS.  In fact, they are somewhat analogous to 
> the mapper output of MapReduce's shuffle, which is what led me to 
> auxiliary-services in the first place.  It would be much easier if we could 
> just place our service's JARs in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1151) Ability to configure auxiliary services from HDFS-based JAR files

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-1151:
-
Fix Version/s: 2.10.2

> Ability to configure auxiliary services from HDFS-based JAR files
> -
>
> Key: YARN-1151
> URL: https://issues.apache.org/jira/browse/YARN-1151
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.1.0-beta, 2.9.0
>Reporter: john lilley
>Assignee: Xuan Gong
>Priority: Major
>  Labels: auxiliary-service, yarn
> Fix For: 3.2.0, 3.1.1, 2.10.2
>
> Attachments: YARN-1151.1.patch, YARN-1151.2.patch, YARN-1151.3.patch, 
> YARN-1151.4.patch, YARN-1151.5.patch, YARN-1151.6.patch, 
> YARN-1151.branch-2.poc.2.patch, YARN-1151.branch-2.poc.3.patch, 
> YARN-1151.branch-2.poc.patch, [YARN-1151] [Design] Configure auxiliary 
> services from HDFS-based JAR files.pdf
>
>
> I would like to install an auxiliary service in Hadoop YARN without actually 
> installing files/services on every node in the system.  Discussions on the 
> user@ list indicate that this is not easily done.  The reason we want an 
> auxiliary service is that our application has some persistent-data components 
> that are not appropriate for HDFS.  In fact, they are somewhat analogous to 
> the mapper output of MapReduce's shuffle, which is what led me to 
> auxiliary-services in the first place.  It would be much easier if we could 
> just place our service's JARs in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10698:
--
Attachment: YARN-10698.branch-2.10.00.patch

> Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
> -
>
> Key: YARN-10698
> URL: https://issues.apache.org/jira/browse/YARN-10698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10698.branch-2.10.00.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10698:
--
Target Version/s: 2.10.2

> Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
> -
>
> Key: YARN-10698
> URL: https://issues.apache.org/jira/browse/YARN-10698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10698:
--
Affects Version/s: 2.10.1

> Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
> -
>
> Key: YARN-10698
> URL: https://issues.apache.org/jira/browse/YARN-10698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10698:
--
Summary: Backport YARN-1151 (load auxiliary service from HDFS archives) to 
branch-2.10  (was: Backport YARN-1151 (load auxiliary service from HDFS 
archives) to branch-2)

> Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
> -
>
> Key: YARN-10698
> URL: https://issues.apache.org/jira/browse/YARN-10698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2

2021-03-16 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10698:
--
Target Version/s:   (was: 2.10.2)

> Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2
> --
>
> Key: YARN-10698
> URL: https://issues.apache.org/jira/browse/YARN-10698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2

2021-03-16 Thread Haibo Chen (Jira)
Haibo Chen created YARN-10698:
-

 Summary: Backport YARN-1151 (load auxiliary service from HDFS 
archives) to branch-2
 Key: YARN-10698
 URL: https://issues.apache.org/jira/browse/YARN-10698
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Haibo Chen
Assignee: Haibo Chen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-25 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291263#comment-17291263
 ] 

Haibo Chen commented on YARN-10651:
---

I updated the patch to add some logging. 

Unit test wise, the key condition to trigger this is that the scheduler thread 
must process a healthy node update event after the corresponding node turned 
into the DECOMMISSIONING state (see the diagram for event ordering), which only 
happens in a very busy cluster.

There isn't anything we can use right now in unit test to artificially slow 
down the scheduler thread, wait for the node to be DECOMMISSIONING and then 
allow it to process node update.  

 

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10651.00.patch, YARN-10651.01.patch, event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-25 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10651:
--
Attachment: YARN-10651.01.patch

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10651.00.patch, YARN-10651.01.patch, event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10651:
--
Attachment: YARN-10651.00.patch

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10651.00.patch, event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10651:
--
Affects Version/s: 2.10.0
   2.10.1

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290641#comment-17290641
 ] 

Haibo Chen commented on YARN-10651:
---

When a node update scheduler event is processed by the scheduler thread, the 
node might have turned unhealthy and taken to Decommissioning state, in which 
case the scheduler would generate a NodeResourceUpdateSchedulerEvent.  If there 
is already a NodeRemovedSchedulerEvent on the scheduler event loop (because the 
node was unhealthy), then the scheduler thread would first process 
NodeRemovedSchedulerEvent, removing the schedulerNode and then process 
NodeResourceUpdateSchedulerEvent which currently assumes the scheduler is still 
there.



The attached diagram shows the sequence of events triggering this.

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10651:
--
Attachment: event_seq.jpg

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290636#comment-17290636
 ] 

Haibo Chen commented on YARN-10651:
---

Relevant RM log

 
{code:java}
6553854:2021-02-24 17:06:33,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6553856:2021-02-24 17:06:33,935 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
xxx.linkedin.com:8041 Node Transitioned from RUNNING to UNHEALTHY

6667464:2021-02-24 17:06:43,316 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
decommission node xxx.linkedin.com:8041 with state UNHEALTHY

6667894:2021-02-24 17:06:43,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
xxx.linkedin.com:8041 in DECOMMISSIONING.
6667896:2021-02-24 17:06:43,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
xxx.linkedin.com:8041 Node Transitioned from UNHEALTHY to DECOMMISSIONING
 
6674223:2021-02-24 17:06:44,019 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6685460:2021-02-24 17:06:45,021 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6694638:2021-02-24 17:06:46,021 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6708206:2021-02-24 17:06:46,482 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: No action for 
node xxx.linkedin.com:8041 with state DECOMMISSIONING
6713019:2021-02-24 17:06:47,064 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6722017:2021-02-24 17:06:48,022 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6731628:2021-02-24 17:06:49,024 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6743847:2021-02-24 17:06:50,063 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6753586:2021-02-24 17:06:51,026 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6762950:2021-02-24 17:06:52,028 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6772642:2021-02-24 17:06:53,081 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 reported UNHEALTHY with details: ERROR -> /dev/sdi - 
modules.DISK FAILED | OK -> User:yarn,modules.CPU PASS,modules.RAM 
PASS,modules.PROCESSES PASS,modules.NET PASS,modules.TMP_FULL 
PASS,modules.CGROUP PASS
6781739:2021-02-24 17:06:54,033 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node 
xxx.linkedin.com:8041 

[jira] [Created] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)
Haibo Chen created YARN-10651:
-

 Summary: CapacityScheduler crashed with NPE in 
AbstractYarnScheduler.updateNodeResource() 
 Key: YARN-10651
 URL: https://issues.apache.org/jira/browse/YARN-10651
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Haibo Chen
Assignee: Haibo Chen


{code:java}
2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154){code}

at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
Exiting, bbye..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-24 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10651:
--
Description: 
{code:java}
2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
Exiting, bbye..{code}

  was:
{code:java}
2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154){code}

at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
Exiting, bbye..


> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-28 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.02.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.02.patch, YARN-10467.branch-2.10.00.patch, 
> YARN-10467.branch-2.10.01.patch, YARN-10467.branch-2.10.02.patch, 
> YARN-10467.branch-2.10.03.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-28 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.branch-2.10.03.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch, YARN-10467.branch-2.10.03.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-28 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1738#comment-1738
 ] 

Haibo Chen commented on YARN-10467:
---

Thanks for catching this, [~Jim_Brennan]! I'll quickly update the patch to 
address this.

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Fix Version/s: (was: 2.10.2)

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Affects Version/s: 3.2.1

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Affects Version/s: 3.0.3
   3.1.4

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.branch-2.10.02.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221766#comment-17221766
 ] 

Haibo Chen commented on YARN-10467:
---

Thanks for the review, [~jhung]. I updated the branch-2.10 patch to get rid of 
these files.

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221561#comment-17221561
 ] 

Haibo Chen commented on YARN-10467:
---

Updated the patches to address checkstyle issues. The unit test failures seem 
unrelated to this patch.

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.branch-2.10.01.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.01.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-26 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.00.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.branch-2.10.00.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-26 Thread Haibo Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-10467:
--
Attachment: YARN-10467.branch-2.10.00.patch

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.branch-2.10.00.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-19 Thread Haibo Chen (Jira)
Haibo Chen created YARN-10467:
-

 Summary: ContainerIdPBImpl objects can be leaked in 
RMNodeImpl.completedContainers
 Key: YARN-10467
 URL: https://issues.apache.org/jira/browse/YARN-10467
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.10.0
Reporter: Haibo Chen
Assignee: Haibo Chen


In one of our recent heap analysis, we found that the majority of the heap is 
occupied by {{RMNodeImpl.completedContainers}}, which 
accounts for 19GB, out of 24.3 GB.  There are over 86 million ContainerIdPBImpl 
objects, in contrast, only 161,601 RMContainerImpl objects which represent the 
# of active containers that RM is still tracking.  Inspecting some 
ContainerIdPBImpl objects, they belong to applications that have long finished. 
This indicates some sort of memory leak of ContainerIdPBImpl objects in 
RMNodeImpl.

 

Right now, when a container is reported by a NM as completed, it is immediately 
added to RMNodeImpl.completedContainers and later cleaned up after the AM has 
been notified of its completion in the AM-RM heartbeat. The cleanup can be 
broken into a few steps.
 * Step 1:  the completed container is first added to 
RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added to 
{{RMNodeImpl.completedContainers}}).
 * Step 2: During the heartbeat AM-RM heartbeat, the container is removed from 
RMAppAttemptImpl.justFinishedContainers and added to 
RMAppAttemptImpl.finishedContainersSentToAM

Once a completed container gets added to 
RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned up 
from {{RMNodeImpl.completedContainers}}

 

However, if the AM exits (regardless of failure or success) before some 
recently completed containers can be added to  
RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there won’t 
be any future AM-RM heartbeat to perform aforementioned step 2. Hence, these 
objects stay in RMNodeImpl.completedContainers forever.

We have observed in MR that AMs can decide to exit upon success of all it tasks 
without waiting for notification of the completion of every container, or AM 
may just die suddenly (e.g. OOM).  Spark and other framework may just be 
similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8213) Add Capacity Scheduler performance metrics

2020-03-27 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069102#comment-17069102
 ] 

Haibo Chen commented on YARN-8213:
--

Thanks for the backport, [~jhung] . +1 on the 2.10 patch.

> Add Capacity Scheduler performance metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, 
> YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, 
> YARN-8213.005.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-23 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065116#comment-17065116
 ] 

Haibo Chen commented on YARN-10200:
---

+1 on 001 patch pending the checkstyle fix.

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10039) Allow disabling app submission from REST endpoints

2019-12-18 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999400#comment-16999400
 ] 

Haibo Chen commented on YARN-10039:
---

I see. +1 on the patch.

> Allow disabling app submission from REST endpoints
> --
>
> Key: YARN-10039
> URL: https://issues.apache.org/jira/browse/YARN-10039
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10039.001.patch
>
>
> Introduce a configuration which allows disabling /apps/new-application and 
> /apps POST endpoints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10039) Allow disabling app submission from REST endpoints

2019-12-17 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998668#comment-16998668
 ] 

Haibo Chen commented on YARN-10039:
---

[~jhung] Shall we disable all REST endpoints that update/change cluster states 
(e.g. updateSchedulerConfiguration) ?

> Allow disabling app submission from REST endpoints
> --
>
> Key: YARN-10039
> URL: https://issues.apache.org/jira/browse/YARN-10039
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10039.001.patch
>
>
> Introduce a configuration which allows disabling /apps/new-application and 
> /apps POST endpoints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-09-23 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936262#comment-16936262
 ] 

Haibo Chen commented on YARN-9730:
--

I see. Thanks for the clarification, [~jhung]. There are some minor conflicts 
with the 02 patch.  Jenkins build should be able to verify that change.  +1 on 
the 02 patch pending Jenkins.

> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9730.001.patch, YARN-9730.002.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are P will be overridden to Q
>  * 2. In the scheduler, we add apps with node label expression P to a 
> separate data structure. When a node in partition P heartbeats to scheduler, 
> we only try to schedule apps in this data structure. When a node in partition 
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9824) Fall back to configured queue ordering policy class name

2019-09-10 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926885#comment-16926885
 ] 

Haibo Chen commented on YARN-9824:
--

001 Patch looks good to me. +1  from me.

> Fall back to configured queue ordering policy class name
> 
>
> Key: YARN-9824
> URL: https://issues.apache.org/jira/browse/YARN-9824
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9824.001.patch
>
>
> Currently this is how configured queue ordering policy is determined:
> {noformat}
> if (policyType.trim().equals(QUEUE_UTILIZATION_ORDERING_POLICY)) {
>   // Doesn't respect priority
>   qop = new PriorityUtilizationQueueOrderingPolicy(false);
> } else if (policyType.trim().equals(
> QUEUE_PRIORITY_UTILIZATION_ORDERING_POLICY)) {
>   qop = new PriorityUtilizationQueueOrderingPolicy(true);
> } else {
>   String message =
>   "Unable to construct queue ordering policy=" + policyType + " queue="
>   + queue;
>   throw new YarnRuntimeException(message);
> } {noformat}
> If we want to enable a policy which is not QUEUE_UTILIZATION_ORDERING_POLICY 
> or QUEUE_PRIORITY_UTILIZATION_ORDERING_POLICY, it requires code change here 
> to add a keyword for this policy.
> It'd be easier if the admin could configure a class name here instead of 
> requiring a keyword.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9806) TestNMSimulator#testNMSimulator fails in branch-2

2019-09-03 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921616#comment-16921616
 ] 

Haibo Chen commented on YARN-9806:
--

+1 on the patch. Thanks for the fix, [~jhung]!

> TestNMSimulator#testNMSimulator fails in branch-2
> -
>
> Key: YARN-9806
> URL: https://issues.apache.org/jira/browse/YARN-9806
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9806-branch-2.001.patch
>
>
> {noformat}java.lang.AssertionError: expected:<10240> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.sls.nodemanager.TestNMSimulator.testNMSimulator(TestNMSimulator.java:92)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at org.junit.runners.Suite.runChild(Suite.java:127)
>   at org.junit.runners.Suite.runChild(Suite.java:26)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413){noformat}
> This appears fixed in YARN-7929. We only need the bit in TestNMSimulator 
> though. This jira is to track getting this bit in branch-2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-08-28 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918149#comment-16918149
 ] 

Haibo Chen commented on YARN-9730:
--

Thanks for the patch [~jhung]!  Trying to understand the enforced-exclusive 
partition concept, is a partition exclusive in the sense that only applications 
with appSubmissionContext node Label set to that partition will have access to 
the resources within and only within that partition (and those apps without the 
partition as their appSubmissionContext node label will not be given access) ?  

Is the newly introduced SchedulerAppAttemt.nodeLabelExpression the same as 
SchedulerAppAttempt.appAMNodePartitionName ? If so, we can reuse the 
appAMNodePartition. The notion of the node label expression for an app would 
probably not make much sense for those that are not submitted to an enforced 
partition, because they can span multiple partitions.

 

> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9730.001.patch, YARN-9730.002.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are P will be overridden to Q
>  * 2. In the scheduler, we add apps with node label expression P to a 
> separate data structure. When a node in partition P heartbeats to scheduler, 
> we only try to schedule apps in this data structure. When a node in partition 
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9770) Create a queue ordering policy which picks child queues with equal probability

2019-08-27 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917058#comment-16917058
 ] 

Haibo Chen commented on YARN-9770:
--

Thanks [~jhung] for the patch. The patch looks good to me overall. I have two 
minor comments.

1) Can we rename FairQueueOrderingPolicy to RandomQueueOrderingPolicy to reduce 
cognitive load as the notion of fairness has been used in FairScheduler for a 
different meaning?

2) In the constructor of RandomIterator, given that we kinda assume that the 
swap operation is efficient and we are only passing in ArrayList, how 
about we restrict the type to ArrayList?

The checkstyle issue can also be addressed.

> Create a queue ordering policy which picks child queues with equal probability
> --
>
> Key: YARN-9770
> URL: https://issues.apache.org/jira/browse/YARN-9770
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9770.001.patch, YARN-9770.002.patch
>
>
> Ran some simulations with the default queue_utilization_ordering_policy:
> An underutilized queue which receives an application with many (thousands) 
> resource requests will hog scheduler allocations for a long time (on the 
> order of a minute). In the meantime apps are getting submitted to all other 
> queues, which increases activeUsers in these queues, which drops user limit 
> in these queues to small values if minimum-user-limit-percent is configured 
> to small values (e.g. 10%).
> To avoid this issue, we assign to queues with equal probability, to avoid 
> scenarios where queues don't get allocations for a long time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9438) launchTime not written to state store for running applications

2019-08-27 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917051#comment-16917051
 ] 

Haibo Chen commented on YARN-9438:
--

+1 on the latest 004 patch

> launchTime not written to state store for running applications
> --
>
> Key: YARN-9438
> URL: https://issues.apache.org/jira/browse/YARN-9438
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9438-branch-2.001.patch, 
> YARN-9438-branch-2.002.patch, YARN-9438.001.patch, YARN-9438.002.patch, 
> YARN-9438.003.patch, YARN-9438.004.patch
>
>
> launchTime is only saved to state store after application finishes, so if 
> restart happens, any running applications will have launchTime set as -1 
> (since this is the default timestamp of the recovery event).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9438) launchTime not written to state store for running applications

2019-08-20 Thread Haibo Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911679#comment-16911679
 ] 

Haibo Chen commented on YARN-9438:
--

Thanks for the patch, [~jhung]. I have one question on the patch. AppLaunch 
time is saved in the state store synchronously in the RMAppImpl state 
transition when an attempt is launch. This will block the RM dispatcher thread, 
which may not be ideal. Do you think we can do it asynchronously instead?

> launchTime not written to state store for running applications
> --
>
> Key: YARN-9438
> URL: https://issues.apache.org/jira/browse/YARN-9438
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9438-branch-2.001.patch, 
> YARN-9438-branch-2.002.patch, YARN-9438.001.patch, YARN-9438.002.patch, 
> YARN-9438.003.patch
>
>
> launchTime is only saved to state store after application finishes, so if 
> restart happens, any running applications will have launchTime set as -1 
> (since this is the default timestamp of the recovery event).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic

2019-08-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9559:
-
Fix Version/s: 3.3.0

> Create AbstractContainersLauncher for pluggable ContainersLauncher logic
> 
>
> Key: YARN-9559
> URL: https://issues.apache.org/jira/browse/YARN-9559
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9559.001.patch, YARN-9559.002.patch, 
> YARN-9559.003.patch, YARN-9559.004.patch, YARN-9559.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic

2019-08-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901450#comment-16901450
 ] 

Haibo Chen commented on YARN-9559:
--

The unit test failure is reported at 

YARN-5857, independent of the change here. Committing to trunk soon.

> Create AbstractContainersLauncher for pluggable ContainersLauncher logic
> 
>
> Key: YARN-9559
> URL: https://issues.apache.org/jira/browse/YARN-9559
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9559.001.patch, YARN-9559.002.patch, 
> YARN-9559.003.patch, YARN-9559.004.patch, YARN-9559.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic

2019-08-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901373#comment-16901373
 ] 

Haibo Chen commented on YARN-9559:
--

+1 on the latest patch pending Jenkins.

> Create AbstractContainersLauncher for pluggable ContainersLauncher logic
> 
>
> Key: YARN-9559
> URL: https://issues.apache.org/jira/browse/YARN-9559
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9559.001.patch, YARN-9559.002.patch, 
> YARN-9559.003.patch, YARN-9559.004.patch, YARN-9559.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-19 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889074#comment-16889074
 ] 

Haibo Chen commented on YARN-9668:
--

+1 on the latest branch-2 patch.

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9668-branch-2.001.patch, 
> YARN-9668-branch-2.002.patch, YARN-9668-branch-3.2.001.patch, 
> YARN-9668.001.patch, YARN-9668.002.patch, YARN-9668.003.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9646) DistributedShell tests failed to bind to a local host name

2019-07-16 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886578#comment-16886578
 ] 

Haibo Chen commented on YARN-9646:
--

[~HappyRay] I have fixed the minor import issues along with my commit. It has 
now been merged to trunk. Thanks for your contribution!

> DistributedShell tests failed to bind to a local host name
> --
>
> Key: YARN-9646
> URL: https://issues.apache.org/jira/browse/YARN-9646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.4
>Reporter: Ray Yang
>Assignee: Ray Yang
>Priority: Major
> Attachments: YARN-9646.00.patch
>
>
> When running the integration test 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell#testDSShellWithoutDomain
> at home
> The following error happened:
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
>  
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:327)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:99)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:447)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setupInternal(TestDistributedShell.java:91)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setup(TestDistributedShell.java:71)
> …
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
> at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
> at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
> at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.*ResourceTrackerService.serviceStart*(ResourceTrackerService.java:163)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:976)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1017)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1013)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1013)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1053)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:319)
> ... 31 more
> Caused by: java.net.BindException: Problem binding to 
> [ruyang-mn3.linkedin.biz:0]java.net.BindException: Can't assign requested 
> address; For more details see:  [http://wiki.apache.org/hadoop/BindException]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:721)
> at 

[jira] [Updated] (YARN-9646) DistributedShell tests failed to bind to a local host name

2019-07-16 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9646:
-
Summary: DistributedShell tests failed to bind to a local host name  (was: 
Yarn miniYarn cluster tests failed to bind to a local host name)

> DistributedShell tests failed to bind to a local host name
> --
>
> Key: YARN-9646
> URL: https://issues.apache.org/jira/browse/YARN-9646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.4
>Reporter: Ray Yang
>Assignee: Ray Yang
>Priority: Major
> Attachments: YARN-9646.00.patch
>
>
> When running the integration test 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell#testDSShellWithoutDomain
> at home
> The following error happened:
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
>  
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:327)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:99)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:447)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setupInternal(TestDistributedShell.java:91)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setup(TestDistributedShell.java:71)
> …
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
> at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
> at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
> at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.*ResourceTrackerService.serviceStart*(ResourceTrackerService.java:163)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:976)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1017)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1013)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1013)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1053)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:319)
> ... 31 more
> Caused by: java.net.BindException: Problem binding to 
> [ruyang-mn3.linkedin.biz:0]java.net.BindException: Can't assign requested 
> address; For more details see:  [http://wiki.apache.org/hadoop/BindException]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:721)
> at org.apache.hadoop.ipc.Server.bind(Server.java:494)
> at 

[jira] [Commented] (YARN-9646) Yarn miniYarn cluster tests failed to bind to a local host name

2019-07-15 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885353#comment-16885353
 ] 

Haibo Chen commented on YARN-9646:
--

Thanks [~ste...@apache.org] for the clarification. Agreed that the 
MiniYARNCluster is fussy as I have seen other issues with it in the past. I 
believe this change will an improvement at least. +1 on the change pending 
Jenkins report given it has been a few weeks since it was submitted. 

I have attached the patch from the git pull request to trigger the Jenkins 
build.

> Yarn miniYarn cluster tests failed to bind to a local host name
> ---
>
> Key: YARN-9646
> URL: https://issues.apache.org/jira/browse/YARN-9646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.4
>Reporter: Ray Yang
>Assignee: Ray Yang
>Priority: Major
> Attachments: YARN-9646.00.patch
>
>
> When running the integration test 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell#testDSShellWithoutDomain
> at home
> The following error happened:
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
>  
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:327)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:99)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:447)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setupInternal(TestDistributedShell.java:91)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setup(TestDistributedShell.java:71)
> …
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
> at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
> at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
> at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.*ResourceTrackerService.serviceStart*(ResourceTrackerService.java:163)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:976)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1017)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1013)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1013)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1053)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:319)
> ... 31 more
> Caused by: java.net.BindException: Problem binding to 
> [ruyang-mn3.linkedin.biz:0]java.net.BindException: Can't assign requested 
> address; For more details see:  [http://wiki.apache.org/hadoop/BindException]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> 

[jira] [Updated] (YARN-9646) Yarn miniYarn cluster tests failed to bind to a local host name

2019-07-15 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9646:
-
Attachment: YARN-9646.00.patch

> Yarn miniYarn cluster tests failed to bind to a local host name
> ---
>
> Key: YARN-9646
> URL: https://issues.apache.org/jira/browse/YARN-9646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.4
>Reporter: Ray Yang
>Assignee: Ray Yang
>Priority: Major
> Attachments: YARN-9646.00.patch
>
>
> When running the integration test 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell#testDSShellWithoutDomain
> at home
> The following error happened:
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
>  
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:327)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:99)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:447)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setupInternal(TestDistributedShell.java:91)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setup(TestDistributedShell.java:71)
> …
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
> at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
> at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
> at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.*ResourceTrackerService.serviceStart*(ResourceTrackerService.java:163)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:976)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1017)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1013)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1013)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1053)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:319)
> ... 31 more
> Caused by: java.net.BindException: Problem binding to 
> [ruyang-mn3.linkedin.biz:0]java.net.BindException: Can't assign requested 
> address; For more details see:  [http://wiki.apache.org/hadoop/BindException]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:721)
> at org.apache.hadoop.ipc.Server.bind(Server.java:494)
> at org.apache.hadoop.ipc.Server$Listener.(Server.java:715)
> at 

[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-11 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883374#comment-16883374
 ] 

Haibo Chen commented on YARN-9668:
--

[~jhung] Do you intend to fix this for branch-2 as well?The branch-3.2 does not 
apply cleanly to branch-2,.

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9668-branch-3.2.001.patch, YARN-9668.001.patch, 
> YARN-9668.002.patch, YARN-9668.003.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-11 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883326#comment-16883326
 ] 

Haibo Chen commented on YARN-9668:
--

+1 on the latest patch. Committing it shortly.

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9668.001.patch, YARN-9668.002.patch, 
> YARN-9668.003.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9646) Yarn miniYarn cluster tests failed to bind to a local host name

2019-07-09 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881455#comment-16881455
 ] 

Haibo Chen commented on YARN-9646:
--

[~ste...@apache.org] Do you still have any concerns?

> Yarn miniYarn cluster tests failed to bind to a local host name
> ---
>
> Key: YARN-9646
> URL: https://issues.apache.org/jira/browse/YARN-9646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.4
>Reporter: Ray Yang
>Assignee: Ray Yang
>Priority: Major
>
> When running the integration test 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell#testDSShellWithoutDomain
> at home
> The following error happened:
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
>  
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:327)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:99)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:447)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setupInternal(TestDistributedShell.java:91)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.setup(TestDistributedShell.java:71)
> …
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [ruyang-mn3.linkedin.biz:0] 
> java.net.BindException: Can't assign requested address; For more details see: 
>  [http://wiki.apache.org/hadoop/BindException]
> at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
> at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
> at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.*ResourceTrackerService.serviceStart*(ResourceTrackerService.java:163)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:976)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1017)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1013)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1013)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1053)
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:319)
> ... 31 more
> Caused by: java.net.BindException: Problem binding to 
> [ruyang-mn3.linkedin.biz:0]java.net.BindException: Can't assign requested 
> address; For more details see:  [http://wiki.apache.org/hadoop/BindException]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:721)
> at org.apache.hadoop.ipc.Server.bind(Server.java:494)
> at org.apache.hadoop.ipc.Server$Listener.(Server.java:715)
> at 

[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-08 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880717#comment-16880717
 ] 

Haibo Chen commented on YARN-9668:
--

+1 pending Jenkins.

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9668.001.patch, YARN-9668.002.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-08 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880618#comment-16880618
 ] 

Haibo Chen commented on YARN-9668:
--

Thanks for the patch, [~jhung]. I have a minor comment.

Can we use expectedException instead?  nodemanager.stop() can be wrapped in a 
finally block as such. 

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9668.001.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2019-05-14 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839715#comment-16839715
 ] 

Haibo Chen commented on YARN-2194:
--

+1 on the patch pending Jenkins.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, 
> YARN-2194-4.patch, YARN-2194-5.patch, YARN-2194-6.patch, YARN-2194-7.patch, 
> YARN-2194-branch-2.7.001.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9529) Log correct cpu controller path on error while initializing CGroups.

2019-05-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834122#comment-16834122
 ] 

Haibo Chen commented on YARN-9529:
--

+1 on the patch. Committed to trunk, branch-3.2/1/0.x and branch-2.

> Log correct cpu controller path on error while initializing CGroups.
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0, 3.3.0
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.3, 3.2.2
>
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Affects Version/s: 3.2.0

> Log correct cpu controller path on error
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Labels: 2.10.0 3.3.0  (was: 2.10.0)

> Log correct cpu controller path on error
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0, 3.3.0
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error while initializing CGroups.

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Summary: Log correct cpu controller path on error while initializing 
CGroups.  (was: Log correct cpu controller path on error)

> Log correct cpu controller path on error while initializing CGroups.
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0, 3.3.0
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Component/s: nodemanager

> Log correct cpu controller path on error
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Affects Version/s: 2.10.0

> Log correct cpu controller path on error
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9529) Log correct cpu controller path on error

2019-05-06 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9529:
-
Labels: 2.10.0  (was: )

> Log correct cpu controller path on error
> 
>
> Key: YARN-9529
> URL: https://issues.apache.org/jira/browse/YARN-9529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: 2.10.0
> Attachments: YARN-9529.001.patch
>
>
> The base cpu controller path is logged instead of the hadoop cgroup path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9111) NM crashes because Fair scheduler promotes a container that has not been pulled by AM

2018-12-11 Thread Haibo Chen (JIRA)
Haibo Chen created YARN-9111:


 Summary: NM crashes because Fair scheduler promotes a container 
that has not been pulled by AM
 Key: YARN-9111
 URL: https://issues.apache.org/jira/browse/YARN-9111
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler, nodemanager
Affects Versions: YARN-1011
Reporter: Haibo Chen


{code:java}
2018-10-19 22:34:35,052 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
 java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:323)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.handle(ContainerManagerImpl.java:1649)
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.handle(ContainerManagerImpl.java:185)
 at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
 at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
 at java.lang.Thread.run(Thread.java:748)
 2018-10-19 22:34:35,054 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Exiting, bbye..
 2018-10-19 22:34:35,059 DEBUG org.apache.hadoop.service.AbstractService: 
Service: NodeManager entered state STOPPED{code}
 

 
When a container is allocated by RM to an application, its container token is 
not generated until the AM pulls that container from RM.

However, it the scheduler decides to promote that container before it is pulled 
by the AM, it does not have container token to work with.

The current code does not update/generate the container token as such. When 
container promotion is sent to NM to process, the NM crashes on NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9110) Fair Scheduler promotion does not update container execution type when the application is killed

2018-12-11 Thread Haibo Chen (JIRA)
Haibo Chen created YARN-9110:


 Summary: Fair Scheduler promotion does not update container 
execution type when the application is killed
 Key: YARN-9110
 URL: https://issues.apache.org/jira/browse/YARN-9110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Affects Versions: YARN-1011
Reporter: Haibo Chen


When the application that a container belongs to gets killed right before the 
container is promoted, the resource booking keeping is updated to reflect the 
promotion, or, but the container token won't be updated.Hence when the 
container is released, it is seen as an OPPORTUNISTIC container. 

We need to make the container promotion, including resource bookkeeping and 
token update atomic to avoid this problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9008) Extend YARN distributed shell with file localization feature

2018-12-11 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718096#comment-16718096
 ] 

Haibo Chen commented on YARN-9008:
--

+1 on the latest patch. Checking it in shortly.

> Extend YARN distributed shell with file localization feature
> 
>
> Key: YARN-9008
> URL: https://issues.apache.org/jira/browse/YARN-9008
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9008-001.patch, YARN-9008-002.patch, 
> YARN-9008-003.patch, YARN-9008-004.patch, YARN-9008-005.patch, 
> YARN-9008-006.patch, YARN-9008-007.patch
>
>
> YARN distributed shell is a very handy tool to test various features of YARN.
> However, it lacks support for file localization - that is, you define files 
> in the command line that you wish to be localized remotely. This can be 
> extremely useful in certain scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9051) Integrate multiple CustomResourceTypesConfigurationProvider implementations into one

2018-12-11 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717858#comment-16717858
 ] 

Haibo Chen commented on YARN-9051:
--

The unit test failures and license issue are unrelated. +1 on the latest patch 
and checking it in shortly.

> Integrate multiple CustomResourceTypesConfigurationProvider implementations 
> into one
> 
>
> Key: YARN-9051
> URL: https://issues.apache.org/jira/browse/YARN-9051
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-9051.001.patch, YARN-9051.002.patch, 
> YARN-9051.003.patch
>
>
> CustomResourceTypesConfigurationProvider (extends LocalConfigurationProvider) 
> has 5 implementations on trunk nowadays.
> These could be integrated into 1 common class.
> Also, 
> {{org.apache.hadoop.yarn.util.resource.TestResourceUtils#addNewTypesToResources}}
>  has similar functionality so this can be considered as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9008) Extend YARN distributed shell with file localization feature

2018-12-10 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715459#comment-16715459
 ] 

Haibo Chen commented on YARN-9008:
--

{quote}I took that from {{Client.java}}, where it's also called {{appname}}. 
Shall I re-name it anyway?
{quote}
 Let's keep it then.  I was not aware of that.
{quote}I think "--lib" would imply that we deal with jar files. Since it's a 
somewhat generic YARN application
{quote}
Fair enough.
{quote}Unfortunately that piece of code is located in a {{forEach}} lambda and 
a {{run()}} method which cannot be declared to throw {{IOException}}.
{quote}
I see. That makes sense. But we could do better with UncheckedIOException to be 
more specific. I think we can get rid of the RuntimeException outside of the 
lambda in ApplicationMaster.java too
{code:java}
    FileSystem fs;
    try {
  fs = FileSystem.get(conf);
    } catch (IOException e) {
  throw new RuntimeException("Cannot get FileSystem", e);
    }
{code}
IllegalArgumentException does not sound a good fit in Client.java when files 
are not readable or do not exist. UncheckedIOException can help here too.

 

Please also address the license issue with the two newly added text files for 
tests.

> Extend YARN distributed shell with file localization feature
> 
>
> Key: YARN-9008
> URL: https://issues.apache.org/jira/browse/YARN-9008
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9008-001.patch, YARN-9008-002.patch, 
> YARN-9008-003.patch, YARN-9008-004.patch, YARN-9008-005.patch
>
>
> YARN distributed shell is a very handy tool to test various features of YARN.
> However, it lacks support for file localization - that is, you define files 
> in the command line that you wish to be localized remotely. This can be 
> extremely useful in certain scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9087) Improve logging for initialization of Resource plugins

2018-12-10 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9087:
-
Fix Version/s: 3.3.0

> Improve logging for initialization of Resource plugins
> --
>
> Key: YARN-9087
> URL: https://issues.apache.org/jira/browse/YARN-9087
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9087.001.patch, YARN-9087.002.patch
>
>
> The patch includes the following enahncements for logging: 
> - Logging initializer code of resource handlers in 
> {{LinuxContainerExecutor#init}}
> - Logging initializer code of resource plugins in 
> {{ResourcePluginManager#initialize}}
> - Added toString to {{ResourceHandlerChain}}
> - Added toString to all implementations to subclasses of {{ResourcePlugin}} 
> as they are printed in {{ResourcePluginManager#initialize}}
> - Added toString to all implementations to subclasses of {{ResourceHandler}} 
> as they are printed as a field of the {{LinuxContainerExecutor#init}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9087) Better logging for initialization of Resource plugins

2018-12-10 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715401#comment-16715401
 ] 

Haibo Chen commented on YARN-9087:
--

+1 on the latest patch.

> Better logging for initialization of Resource plugins
> -
>
> Key: YARN-9087
> URL: https://issues.apache.org/jira/browse/YARN-9087
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9087.001.patch, YARN-9087.002.patch
>
>
> The patch includes the following enahncements for logging: 
> - Logging initializer code of resource handlers in 
> {{LinuxContainerExecutor#init}}
> - Logging initializer code of resource plugins in 
> {{ResourcePluginManager#initialize}}
> - Added toString to {{ResourceHandlerChain}}
> - Added toString to all implementations to subclasses of {{ResourcePlugin}} 
> as they are printed in {{ResourcePluginManager#initialize}}
> - Added toString to all implementations to subclasses of {{ResourceHandler}} 
> as they are printed as a field of the {{LinuxContainerExecutor#init}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9087) Improve logging for initialization of Resource plugins

2018-12-10 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9087:
-
Summary: Improve logging for initialization of Resource plugins  (was: 
Better logging for initialization of Resource plugins)

> Improve logging for initialization of Resource plugins
> --
>
> Key: YARN-9087
> URL: https://issues.apache.org/jira/browse/YARN-9087
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9087.001.patch, YARN-9087.002.patch
>
>
> The patch includes the following enahncements for logging: 
> - Logging initializer code of resource handlers in 
> {{LinuxContainerExecutor#init}}
> - Logging initializer code of resource plugins in 
> {{ResourcePluginManager#initialize}}
> - Added toString to {{ResourceHandlerChain}}
> - Added toString to all implementations to subclasses of {{ResourcePlugin}} 
> as they are printed in {{ResourcePluginManager#initialize}}
> - Added toString to all implementations to subclasses of {{ResourceHandler}} 
> as they are printed as a field of the {{LinuxContainerExecutor#init}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8738) FairScheduler configures maxResources or minResources as negative, the value parse to a positive number.

2018-12-10 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715375#comment-16715375
 ] 

Haibo Chen commented on YARN-8738:
--

+1 on the latest patch. Checking it in shortly.

> FairScheduler configures maxResources or minResources as negative, the value 
> parse to a positive number.
> 
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8738.001.patch, YARN-8738.002.patch, 
> YARN-8738.003.patch
>
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8738) FairScheduler should not parse negative maxResources or minResources values as positive

2018-12-10 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8738:
-
Summary: FairScheduler should not parse negative maxResources or 
minResources values as positive  (was: FairScheduler configures maxResources or 
minResources as negative, the value parse to a positive number.)

> FairScheduler should not parse negative maxResources or minResources values 
> as positive
> ---
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8738.001.patch, YARN-8738.002.patch, 
> YARN-8738.003.patch
>
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9008) Extend YARN distributed shell with file localization feature

2018-12-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712143#comment-16712143
 ] 

Haibo Chen commented on YARN-9008:
--

Thanks [~pbacsko] for the patch!  A few minor comments

1) We are missing one unit test for upload a non-existent file and one for a 
directory.

2) The new commandline option 'appname' should probably be renamed to 
'app_name' for the sake of consistency with other options

3) All IOExceptions are wrapped in a RunTimeException. But I am not sure why 
benefits it provides than just directly throwing IOException.

4) I notice 2.9.1 is included in the affect version. Do you intend to backport 
this into branch-2? If so, we shall not use stream api that is only supported 
in Java 8.

5)  The relative path of a file is composed of the app_name, appId and the file 
name. We have two copies of the same code in both ApplicationMaster and Client. 
If only one copy is changed in the future, the feature would fail. Can we 
centralize them in one place?

6) 'localized_files' sounds very much into the implementation details. 
MapReduce jobs client can add lib files at submission time, which are under the 
hood uploaded to HDFS and localized for access. We have almost the same idea 
here. What do you think of renaming it to 'lib'?

> Extend YARN distributed shell with file localization feature
> 
>
> Key: YARN-9008
> URL: https://issues.apache.org/jira/browse/YARN-9008
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.9.1, 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9008-001.patch, YARN-9008-002.patch, 
> YARN-9008-003.patch, YARN-9008-004.patch
>
>
> YARN distributed shell is a very handy tool to test various features of YARN.
> However, it lacks support for file localization - that is, you define files 
> in the command line that you wish to be localized remotely. This can be 
> extremely useful in certain scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9087) Better logging for initialization of Resource plugins

2018-12-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712108#comment-16712108
 ] 

Haibo Chen commented on YARN-9087:
--

Thanks [~snemeth] for the patch. A few nits about the patch.

1) Not sure why the logging in ContainerScheduler is removed. I think we should 
keep it.  Container Scheduler would try to bootstrap cgroups if cgroups has not 
been initialized elsewhere.

2) All the toString() methods include the class name. We can used 
XXX.class.getName() instead in case the name changes.

3) IMO, only immutable fields that are initialized in the constructor should be 
included in the toString(). It may be confusing/misleading if the value changes 
later after toString() is called.

> Better logging for initialization of Resource plugins
> -
>
> Key: YARN-9087
> URL: https://issues.apache.org/jira/browse/YARN-9087
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9087.001.patch
>
>
> The patch includes the following enahncements for logging: 
> - Logging initializer code of resource handlers in 
> {{LinuxContainerExecutor#init}}
> - Logging initializer code of resource plugins in 
> {{ResourcePluginManager#initialize}}
> - Added toString to {{ResourceHandlerChain}}
> - Added toString to all implementations to subclasses of {{ResourcePlugin}} 
> as they are printed in {{ResourcePluginManager#initialize}}
> - Added toString to all implementations to subclasses of {{ResourceHandler}} 
> as they are printed as a field of the {{LinuxContainerExecutor#init}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8738) FairScheduler configures maxResources or minResources as negative, the value parse to a positive number.

2018-12-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712039#comment-16712039
 ] 

Haibo Chen commented on YARN-8738:
--

Thanks [~snemeth] for the elaboration. What I meant is that because only 
findPercentage() throws AllocationConfiurationException per the comment in the 
code,  we can throw AllocationConfigurationException too for negative values, 
and we can update the parameter of createConfigException to see the value must 
be a non-positive number.  Note that in findPercentage() we have different 
messages for different types of issues. The message should be sufficient enough 
to tell the user what exactly the issue is.
{code:java}
  private static ConfigurableResource parseNewStyleResource(String value,
  long missing) throws AllocationConfigurationException {


} catch (AllocationConfigurationException ex) {
    // This only comes from findPercentage()
    throw createConfigException(value, "The "
    + "resource values must all be percentages. \""
    + resourceValue + "\" is either not a number or does not "
    + "include the '%' symbol.", ex);
  }
    }
    return configurableResource;

  }{code}

> FairScheduler configures maxResources or minResources as negative, the value 
> parse to a positive number.
> 
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8738.001.patch, YARN-8738.002.patch
>
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9035) Allow better troubleshooting of FS container assignments and lack of container assignments

2018-12-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711943#comment-16711943
 ] 

Haibo Chen commented on YARN-9035:
--

Thanks [~snemeth] for the patch. This would be very helpful when it comes to 
debug scheduler decision, like you said. 

I think the current approach of creating new objects to represent Assignment or 
Validation results (AMShareLimitCheckResult) is a bit too heavy, given 
scheduling is executed very often so we should do things efficiently. I am in 
favor of doing just if(isOverAMShareLimit() && LOG.isDebugEnabled()) \{ 
LOG.debug(...);}

Plus, we need to turn on debug log for the new classes in order to get debug 
logs, extra things to do with no gain.

> Allow better troubleshooting of FS container assignments and lack of 
> container assignments
> --
>
> Key: YARN-9035
> URL: https://issues.apache.org/jira/browse/YARN-9035
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9035.001.patch
>
>
> The call chain started from {{FairScheduler.attemptScheduling}}, to 
> {{FSQueue}} (parent / leaf).assignContainer and down to 
> {{FSAppAttempt#assignContainer}} has many calls and has many potential 
> conditions where {{Resources.none()}} can be returned, meaning container is 
> not allocated.
>  A bunch of these empty-assignments do not come with a debug log statement, 
> so it's very hard to tell what condition lead the {{FairScheduler}} to a 
> decision where containers are not allocated.
>  On top of that, in many places, it's difficult to tell either why a 
> container was allocated to an app attempt.
> The goal is to have a common place (i.e. class) that will do all the 
> loggings, so users conveniently can control all the logs if they are curious 
> why (and why not) container assigments happened.
>  Also, it would be handy if readers of the log could easily decide which 
> {{AppAttempt}} is the log record created for, in other words: every log 
> record should include the ID of the application / app attempt, if possible.
>  
> Details of implementation: 
>  As most of the already in-place debug messages were protected by a condition 
> that checks whether the debug level is enabled on loggers, I followed a 
> similar pattern. All the relevant log messages are created with the class 
> {{ResourceAssignment}}. 
>  This class is a wrapper for the assigned {{Resource}} object and has a 
> single logger, so clients should use its helper methods to create log 
> records. There is a helper method called {{shouldLogReservationActivity}} 
> that checks if DEBUG or TRACE level is activated on the logger. 
>  See the javadoc on this class for further information.
>  
> {{ResourceAssignment}} is also responsible for adding the app / appettempt ID 
> to every log record (with some exceptions).
>  A couple of check classes are introduced: They are responsible to run and 
> store results of checks that are dependency of a successful container 
> allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9035) Allow better troubleshooting of FS container assignments and lack of container assignments

2018-12-06 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711944#comment-16711944
 ] 

Haibo Chen commented on YARN-9035:
--

[~wilfreds] probably has a much better idea of what is preferred from a 
support-ability perspective.

> Allow better troubleshooting of FS container assignments and lack of 
> container assignments
> --
>
> Key: YARN-9035
> URL: https://issues.apache.org/jira/browse/YARN-9035
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9035.001.patch
>
>
> The call chain started from {{FairScheduler.attemptScheduling}}, to 
> {{FSQueue}} (parent / leaf).assignContainer and down to 
> {{FSAppAttempt#assignContainer}} has many calls and has many potential 
> conditions where {{Resources.none()}} can be returned, meaning container is 
> not allocated.
>  A bunch of these empty-assignments do not come with a debug log statement, 
> so it's very hard to tell what condition lead the {{FairScheduler}} to a 
> decision where containers are not allocated.
>  On top of that, in many places, it's difficult to tell either why a 
> container was allocated to an app attempt.
> The goal is to have a common place (i.e. class) that will do all the 
> loggings, so users conveniently can control all the logs if they are curious 
> why (and why not) container assigments happened.
>  Also, it would be handy if readers of the log could easily decide which 
> {{AppAttempt}} is the log record created for, in other words: every log 
> record should include the ID of the application / app attempt, if possible.
>  
> Details of implementation: 
>  As most of the already in-place debug messages were protected by a condition 
> that checks whether the debug level is enabled on loggers, I followed a 
> similar pattern. All the relevant log messages are created with the class 
> {{ResourceAssignment}}. 
>  This class is a wrapper for the assigned {{Resource}} object and has a 
> single logger, so clients should use its helper methods to create log 
> records. There is a helper method called {{shouldLogReservationActivity}} 
> that checks if DEBUG or TRACE level is activated on the logger. 
>  See the javadoc on this class for further information.
>  
> {{ResourceAssignment}} is also responsible for adding the app / appettempt ID 
> to every log record (with some exceptions).
>  A couple of check classes are introduced: They are responsible to run and 
> store results of checks that are dependency of a successful container 
> allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9025) Make TestFairScheduler#testChildMaxResources more reliable, as it is flaky now

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710803#comment-16710803
 ] 

Haibo Chen commented on YARN-9025:
--

+1 on the latest patch. Checking it in shortly.

> Make TestFairScheduler#testChildMaxResources more reliable, as it is flaky now
> --
>
> Key: YARN-9025
> URL: https://issues.apache.org/jira/browse/YARN-9025
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9025.001.patch, YARN-9025.002.patch
>
>
> During making the code patch for YARN-8059, I come across a flaky test, see 
> this link: 
> https://builds.apache.org/job/PreCommit-YARN-Build/22412/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> This is the error message: 
> {code:java}
> [ERROR] Tests run: 108, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 19.37 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testChildMaxResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>  Time elapsed: 0.164 s <<< FAILURE!
> java.lang.AssertionError: App 1 is not running with the correct number of 
> containers expected:<2> but was:<0>
>  at org.junit.Assert.fail(Assert.java:88){code}
> So the thing is, even if we had 8 node updates, due to the nature of how we 
> handle the events, it can happen that no container is allocated for the 
> application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9025) TestFairScheduler#testChildMaxResources is flaky

2018-12-05 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen reassigned YARN-9025:


Assignee: Szilard Nemeth  (was: Haibo Chen)

> TestFairScheduler#testChildMaxResources is flaky
> 
>
> Key: YARN-9025
> URL: https://issues.apache.org/jira/browse/YARN-9025
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9025.001.patch, YARN-9025.002.patch
>
>
> During making the code patch for YARN-8059, I come across a flaky test, see 
> this link: 
> https://builds.apache.org/job/PreCommit-YARN-Build/22412/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> This is the error message: 
> {code:java}
> [ERROR] Tests run: 108, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 19.37 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testChildMaxResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>  Time elapsed: 0.164 s <<< FAILURE!
> java.lang.AssertionError: App 1 is not running with the correct number of 
> containers expected:<2> but was:<0>
>  at org.junit.Assert.fail(Assert.java:88){code}
> So the thing is, even if we had 8 node updates, due to the nature of how we 
> handle the events, it can happen that no container is allocated for the 
> application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9025) TestFairScheduler#testChildMaxResources is flaky

2018-12-05 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen reassigned YARN-9025:


Assignee: Haibo Chen  (was: Szilard Nemeth)

> TestFairScheduler#testChildMaxResources is flaky
> 
>
> Key: YARN-9025
> URL: https://issues.apache.org/jira/browse/YARN-9025
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Szilard Nemeth
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-9025.001.patch, YARN-9025.002.patch
>
>
> During making the code patch for YARN-8059, I come across a flaky test, see 
> this link: 
> https://builds.apache.org/job/PreCommit-YARN-Build/22412/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> This is the error message: 
> {code:java}
> [ERROR] Tests run: 108, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 19.37 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testChildMaxResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>  Time elapsed: 0.164 s <<< FAILURE!
> java.lang.AssertionError: App 1 is not running with the correct number of 
> containers expected:<2> but was:<0>
>  at org.junit.Assert.fail(Assert.java:88){code}
> So the thing is, even if we had 8 node updates, due to the nature of how we 
> handle the events, it can happen that no container is allocated for the 
> application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9025) TestFairScheduler#testChildMaxResources is flaky

2018-12-05 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9025:
-
Summary: TestFairScheduler#testChildMaxResources is flaky  (was: Make 
TestFairScheduler#testChildMaxResources more reliable, as it is flaky now)

> TestFairScheduler#testChildMaxResources is flaky
> 
>
> Key: YARN-9025
> URL: https://issues.apache.org/jira/browse/YARN-9025
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9025.001.patch, YARN-9025.002.patch
>
>
> During making the code patch for YARN-8059, I come across a flaky test, see 
> this link: 
> https://builds.apache.org/job/PreCommit-YARN-Build/22412/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> This is the error message: 
> {code:java}
> [ERROR] Tests run: 108, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 19.37 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testChildMaxResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>  Time elapsed: 0.164 s <<< FAILURE!
> java.lang.AssertionError: App 1 is not running with the correct number of 
> containers expected:<2> but was:<0>
>  at org.junit.Assert.fail(Assert.java:88){code}
> So the thing is, even if we had 8 node updates, due to the nature of how we 
> handle the events, it can happen that no container is allocated for the 
> application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9051) Integrate multiple CustomResourceTypesConfigurationProvider implementations into one

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710796#comment-16710796
 ] 

Haibo Chen commented on YARN-9051:
--

Thanks [~snemeth] for the patch. Looks good overall. Some nits:

1) TestTaskAttempt and TestSchedulerUtils.java have some shuffling of import 
statements that is unrelated to this change. Let's revert those changes. (If 
your IDE does so, make sure turn them off)

2) CustomResourceTypeCOnfigurationProvider.DEFAULT_UNIT is misleading because 
'k' is not always used as the unit when no unit is provided, e.g.   public 
static void initResourceTypes(String... resourceTypes) {}, unit is left as 
empty.

3) Given that CustomResourceTypesConfigurationProvider now calls ResourceUtils 
to set or reset resource types globally, it is no longer just an implementation 
of LocalConfigurationProvider. I think we can rename it to 
CustomResourceTypeConfigurationUtils.

> Integrate multiple CustomResourceTypesConfigurationProvider implementations 
> into one
> 
>
> Key: YARN-9051
> URL: https://issues.apache.org/jira/browse/YARN-9051
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-9051.001.patch
>
>
> CustomResourceTypesConfigurationProvider (extends LocalConfigurationProvider) 
> has 5 implementations on trunk nowadays.
> These could be integrated into 1 common class.
> Also, 
> {{org.apache.hadoop.yarn.util.resource.TestResourceUtils#addNewTypesToResources}}
>  has similar functionality so this can be considered as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9019) Ratio calculation of ResourceCalculator implementations could return NaN

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710731#comment-16710731
 ] 

Haibo Chen commented on YARN-9019:
--

+1 checking in shortly

> Ratio calculation of ResourceCalculator implementations could return NaN
> 
>
> Key: YARN-9019
> URL: https://issues.apache.org/jira/browse/YARN-9019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9019.001.patch
>
>
> Found out that ResourceCalculator.ratio (with implementors 
> DefaultResourceCalculator and DominantResourceCalculator) can produce NaN 
> (Not-A-Number) as a result.
> This is because [IEEE 754|http://grouper.ieee.org/groups/754/] defines {{1.0 
> / 0.0}} as Infinity and {{-1.0 / 0.0}} as -Infinity and {{0.0 / 0.0}} as NaN, 
> see here: [https://stackoverflow.com/a/14138032/1106893] 
> I think it's very dangerous to rely on NaN can be returned from ratio 
> calculations and this could have side-effects.
> When ratio calculates the result and if both the numerator and the 
> denominator is zero, we should use 0 as a result, I think.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8985) Improve debug log in FSParentQueue when assigning container

2018-12-05 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8985:
-
Summary: Improve debug log in FSParentQueue when assigning container  (was: 
FSParentQueue: debug log missing when assigning container)

> Improve debug log in FSParentQueue when assigning container
> ---
>
> Key: YARN-8985
> URL: https://issues.apache.org/jira/browse/YARN-8985
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Attachments: YARN-8985.001.patch, YARN-8985.002.patch
>
>
> Tracking assignments in the queue hierarchy is not possible at a DEBUG level 
> because the FSParentQueue does not log a node being offered to the queue.
> This means that if a parent queue has no leaf queues then it will be 
> impossible to track the offering leaving a hole in the tracking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8985) FSParentQueue: debug log missing when assigning container

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710723#comment-16710723
 ] 

Haibo Chen commented on YARN-8985:
--

+1 and checking it in shortly.

> FSParentQueue: debug log missing when assigning container
> -
>
> Key: YARN-8985
> URL: https://issues.apache.org/jira/browse/YARN-8985
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Attachments: YARN-8985.001.patch, YARN-8985.002.patch
>
>
> Tracking assignments in the queue hierarchy is not possible at a DEBUG level 
> because the FSParentQueue does not log a node being offered to the queue.
> This means that if a parent queue has no leaf queues then it will be 
> impossible to track the offering leaving a hole in the tracking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8738) FairScheduler configures maxResources or minResources as negative, the value parse to a positive number.

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710720#comment-16710720
 ] 

Haibo Chen commented on YARN-8738:
--

Thanks [~snemeth] for the patch. Two nits: 

1) we can replace NegativeResourceDefinitionException with a 
AllocationConfigurationException and update the diagnostic messages in   
parseNewStyleResource() accordingly.

2) In addition, all the test cases in 
TestFairSchedulerConfigurationNegativeResourceValues  can be moved to existing 
testFairSchedulerConfiguration.

> FairScheduler configures maxResources or minResources as negative, the value 
> parse to a positive number.
> 
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8738.001.patch, YARN-8738.002.patch
>
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8994) Fix race condition between move app and queue cleanup in Fair Scheduler

2018-12-05 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8994:
-
Summary: Fix race condition between move app and queue cleanup in Fair 
Scheduler  (was: Fix for race condition in move app and queue cleanup in FS)

> Fix race condition between move app and queue cleanup in Fair Scheduler
> ---
>
> Key: YARN-8994
> URL: https://issues.apache.org/jira/browse/YARN-8994
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8994.001.patch
>
>
> Similar to YARN-8990 and also introduced by YARN-8191 there is a race 
> condition while moving an application. The pre-move check looks for the queue 
> and when it finds the queue it progresses. The real move then retrieves the 
> queue and does further check before updating the app and queues.
> The move uses the retrieved queue object but the queue could have become 
> empty while checks are performed. If the cleanup runs at that same time the 
> app will be moved to a deleted queue and lost.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8994) Fix for race condition in move app and queue cleanup in FS

2018-12-05 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710701#comment-16710701
 ] 

Haibo Chen commented on YARN-8994:
--

+1 on the patch. Committing to trunk shortly.

> Fix for race condition in move app and queue cleanup in FS
> --
>
> Key: YARN-8994
> URL: https://issues.apache.org/jira/browse/YARN-8994
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8994.001.patch
>
>
> Similar to YARN-8990 and also introduced by YARN-8191 there is a race 
> condition while moving an application. The pre-move check looks for the queue 
> and when it finds the queue it progresses. The real move then retrieves the 
> queue and does further check before updating the app and queues.
> The move uses the retrieved queue object but the queue could have become 
> empty while checks are performed. If the cleanup runs at that same time the 
> app will be moved to a deleted queue and lost.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9066) Deprecate Fair Scheduler min share

2018-11-29 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704353#comment-16704353
 ] 

Haibo Chen commented on YARN-9066:
--

Go for it.

> Deprecate Fair Scheduler min share
> --
>
> Key: YARN-9066
> URL: https://issues.apache.org/jira/browse/YARN-9066
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Haibo Chen
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: Proposal_Deprecate_FS_Min_Share.pdf
>
>
> See the attached docs for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9066) Depreciate Fair Scheduler min share

2018-11-27 Thread Haibo Chen (JIRA)
Haibo Chen created YARN-9066:


 Summary: Depreciate Fair Scheduler min share
 Key: YARN-9066
 URL: https://issues.apache.org/jira/browse/YARN-9066
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Haibo Chen
 Attachments: Proposal_Deprecate_FS_Min_Share.pdf

See the attached docs for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9066) Depreciate Fair Scheduler min share

2018-11-27 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-9066:
-
Attachment: Proposal_Deprecate_FS_Min_Share.pdf

> Depreciate Fair Scheduler min share
> ---
>
> Key: YARN-9066
> URL: https://issues.apache.org/jira/browse/YARN-9066
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Haibo Chen
>Priority: Major
> Attachments: Proposal_Deprecate_FS_Min_Share.pdf
>
>
> See the attached docs for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9025) Make TestFairScheduler#testChildMaxResources more reliable, as it is flaky now

2018-11-20 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693863#comment-16693863
 ] 

Haibo Chen commented on YARN-9025:
--

[~snemeth] I notice the existence of MockRM.drainEvents(). Do that help get rid 
of the non-determinism completely?

> Make TestFairScheduler#testChildMaxResources more reliable, as it is flaky now
> --
>
> Key: YARN-9025
> URL: https://issues.apache.org/jira/browse/YARN-9025
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9025.001.patch
>
>
> During making the code patch for YARN-8059, I come across a flaky test, see 
> this link: 
> https://builds.apache.org/job/PreCommit-YARN-Build/22412/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> This is the error message: 
> {code:java}
> [ERROR] Tests run: 108, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 19.37 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> [ERROR] 
> testChildMaxResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>  Time elapsed: 0.164 s <<< FAILURE!
> java.lang.AssertionError: App 1 is not running with the correct number of 
> containers expected:<2> but was:<0>
>  at org.junit.Assert.fail(Assert.java:88){code}
> So the thing is, even if we had 8 node updates, due to the nature of how we 
> handle the events, it can happen that no container is allocated for the 
> application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

2018-11-20 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693855#comment-16693855
 ] 

Haibo Chen commented on YARN-8992:
--

+1. Checking it in shortly

> Fair scheduler can delete a dynamic queue while an application attempt is 
> being added to the queue
> --
>
> Key: YARN-8992
> URL: https://issues.apache.org/jira/browse/YARN-8992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.1
>Reporter: Haibo Chen
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8992.001.patch, YARN-8992.002.patch
>
>
> As discovered in YARN-8990, QueueManager can see a leaf queue being empty 
> while FSLeafQueue.addApp() is called in the middle of  
> {code:java}
> return queue.getNumRunnableApps() == 0 &&
>   leafQueue.getNumNonRunnableApps() == 0 &&
>   leafQueue.getNumAssignedApps() == 0;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9026) DefaultOOMHandler should mark preempted containers as killed

2018-11-20 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen resolved YARN-9026.
--
Resolution: Invalid

> DefaultOOMHandler should mark preempted containers as killed
> 
>
> Key: YARN-9026
> URL: https://issues.apache.org/jira/browse/YARN-9026
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Haibo Chen
>Priority: Major
>
> DefaultOOMHandler today kills a selected container by sending kill -9 signal 
> to all processes running within the container cgroup.
> The container would exit with a non-zero code, and hence treated as a failure 
> by ContainerLaunch threads.
> We should instead mark the containers as killed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9026) DefaultOOMHandler should mark preempted containers as killed

2018-11-20 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693638#comment-16693638
 ] 

Haibo Chen commented on YARN-9026:
--

Ah, I was not aware of this.  Indeed, this is already taken care of today. 
Thanks for pointing this out, [~tangzhankun]! Closing this Jira as invalid.

> DefaultOOMHandler should mark preempted containers as killed
> 
>
> Key: YARN-9026
> URL: https://issues.apache.org/jira/browse/YARN-9026
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Haibo Chen
>Priority: Major
>
> DefaultOOMHandler today kills a selected container by sending kill -9 signal 
> to all processes running within the container cgroup.
> The container would exit with a non-zero code, and hence treated as a failure 
> by ContainerLaunch threads.
> We should instead mark the containers as killed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9026) DefaultOOMHandler should mark preempted containers as killed

2018-11-15 Thread Haibo Chen (JIRA)
Haibo Chen created YARN-9026:


 Summary: DefaultOOMHandler should mark preempted containers as 
killed
 Key: YARN-9026
 URL: https://issues.apache.org/jira/browse/YARN-9026
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.2.1
Reporter: Haibo Chen


DefaultOOMHandler today kills a selected container by sending kill -9 signal to 
all processes running within the container cgroup.

The container would exit with a non-zero code, and hence treated as a failure 
by ContainerLaunch threads.

We should instead mark the containers as killed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

2018-11-08 Thread Haibo Chen (JIRA)
Haibo Chen created YARN-8992:


 Summary: Fair scheduler can delete a dynamic queue while an 
application attempt is being added to the queue
 Key: YARN-8992
 URL: https://issues.apache.org/jira/browse/YARN-8992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.1
Reporter: Haibo Chen


QueueManager can see a leaf queue being empty while FSLeafQueue.addApp() is 
called in the middle of  
{code:java}
return queue.getNumRunnableApps() == 0 &&
  leafQueue.getNumNonRunnableApps() == 0 &&
  leafQueue.getNumAssignedApps() == 0;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2018-11-08 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8990:
-
Summary: Fix fair scheduler race condition in app submit and queue cleanup  
(was: FS: race condition in app submit and queue cleanup)

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

2018-11-08 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680630#comment-16680630
 ] 

Haibo Chen commented on YARN-8992:
--

See [^YARN-8990.002.patch] for the proposed fix included in the patch.

> Fair scheduler can delete a dynamic queue while an application attempt is 
> being added to the queue
> --
>
> Key: YARN-8992
> URL: https://issues.apache.org/jira/browse/YARN-8992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.1
>Reporter: Haibo Chen
>Priority: Major
>
> As discovered in YARN-8990, QueueManager can see a leaf queue being empty 
> while FSLeafQueue.addApp() is called in the middle of  
> {code:java}
> return queue.getNumRunnableApps() == 0 &&
>   leafQueue.getNumNonRunnableApps() == 0 &&
>   leafQueue.getNumAssignedApps() == 0;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >