[jira] [Commented] (YARN-9941) Opportunistic scheduler metrics should be reset during fail-over.

2020-07-20 Thread Abhishek Modi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161710#comment-17161710
 ] 

Abhishek Modi commented on YARN-9941:
-

Sure [~BilwaST]. Feel free to take over. Thanks

> Opportunistic scheduler metrics should be reset during fail-over.
> -
>
> Key: YARN-9941
> URL: https://issues.apache.org/jira/browse/YARN-9941
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161699#comment-17161699
 ] 

Wangda Tan commented on YARN-10352:
---

Also, we need to systematically handle the node heartbeat interval problem, in 
a cloud environment, node can be frequently commissioned, if we always wait for 
10 mins timeout, it may not be good, it's better to improve the logic by 
preempting containers newly allocated (by not acquired) on NM which stopped 
heartbeating. With this, we can proactively relocate containers to different 
nodes before the 10 mins timeout. It can be a follow up of this Jira. 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161696#comment-17161696
 ] 

Wangda Tan commented on YARN-10352:
---

Thanks [~prabhujoseph], 

Then it makes sense, but the original logic is too confusing, I think we should 
clean it up, make sure multi-node v.s. single-node allocation, CandidateSet 
v.s. Multi-node sorter be more clear. 

Just one nit, can we reuse this method: 
{code:java}
159  long timeElapsedFromLastHeartbeat =
160  Time.monotonicNow() - cached.getLastHeartbeatMonotonicTime();
161  if (timeElapsedFromLastHeartbeat <= nmHeartbeatInterval * 2) { 
{code}
[~ztang], can you help to take a look? 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161682#comment-17161682
 ] 

Prabhu Joseph commented on YARN-10352:
--

[~wangda] For each node in the list given by 
{{CapacityScheduler#getNodesHeartbeated}}, only allocation of reserved 
containers from that node happens.

Allocate or Reserve new containers uses the multiple node candidates prepared 
by {{MultiNodeSorter#reSortClusterNodes}} (below code snippet) which passes the 
list to the configured {{MultiNodeLookupPolicy}} to perform sorting in 
background at every configured sorting interval. {{MultiNodeSortingManager}} 
filters that list while returning to {{RegularContainerAllocator#allocate}} 
call.

 
{code:java}
  Map nodesByPartition = new HashMap<>();
  List nodes = ((AbstractYarnScheduler) rmContext
  .getScheduler()).getNodeTracker().getNodesPerPartition(label);
  if (nodes != null) {
nodes.forEach(n -> nodesByPartition.put(n.getNodeID(), n));
multiNodePolicy.addAndRefreshNodesSet(
(Collection) nodesByPartition.values(), label);
  }
{code}

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161615#comment-17161615
 ] 

Wangda Tan commented on YARN-10352:
---

[~prabhujoseph], I'm trying to understand this logic, why we have two separate 
logics to filter outdated nodes? We have one in MultiNodeSortingManager and one 
in getNodesHeartbeated. I'm wondering if it is necessary or not.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161556#comment-17161556
 ] 

Hadoop QA commented on YARN-10352:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 55s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
45s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 12s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}173m 19s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26296/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10352 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008033/YARN-10352-003.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux c613a5c0e0c2 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 736bed6d6d2 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| unit | 

[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-20 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161450#comment-17161450
 ] 

Hudson commented on YARN-10353:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18456 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18456/])
[YARN-10353] Log vcores used and cumulative cpu in containers monitor. 
(ebadger: rev 736bed6d6d20a17b522a0686ca3fd2d97e7e6838)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java


> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10353.001.patch, YARN-10353.002.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB 
> physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 
> 10 CPU vCores used. Cumulative CPU time: 2572390
> 2020-07-16 21:00:02,456 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for 
> container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 
> GB physical 

[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-20 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161448#comment-17161448
 ] 

Jim Brennan commented on YARN-10353:


Thanks [~ebadger]!

> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10353.001.patch, YARN-10353.002.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB 
> physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 
> 10 CPU vCores used. Cumulative CPU time: 2572390
> 2020-07-16 21:00:02,456 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for 
> container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 101210
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10353:
---
Fix Version/s: 3.4.0

> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10353.001.patch, YARN-10353.002.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB 
> physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 
> 10 CPU vCores used. Cumulative CPU time: 2572390
> 2020-07-16 21:00:02,456 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for 
> container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 101210
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161439#comment-17161439
 ] 

Hadoop QA commented on YARN-10315:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
1s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 33s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
42s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 88m 
58s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}154m 57s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26295/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10315 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008026/YARN-10315.002.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 5f0c653020b2 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 9f407bcc88a |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/26295/testReport/ |
| Max. process+thread count | 837 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 

[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-003.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10355) Refactor NM ContainerLaunch.java#orderEnvByDependencies

2020-07-20 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10355:
-
Description: 
The 
{{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch#orderEnvByDependencies}}
 and it's helper method \{{getEnvDependencies }}(together with the overrides) 
is hard to read. Some improvements could be made:
 * use Pattern matching in the overrides of getEnvDependencies instead of 
iterating through the environmental variable strings char by char
 * the unit tests contains a lot of repeated code and generally the test 
methods are long - they could be separated into different setup/helper and 
assertion methods

  was:
The 
{{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch#orderEnvByDependencies}}
 and it's helper method {{getEnvDependencies }}(together with the overrides) is 
hard to read. Some improvements could be made:
 * use Pattern matching in the overrides of getEnvDependencies instead of 
iterating through the environmental variable strings char by char
 * the unit tests contains a lot of repeated code and generally the test 
methods are long - it could be separated into different setup/helper and 
assertion methods


> Refactor NM ContainerLaunch.java#orderEnvByDependencies
> ---
>
> Key: YARN-10355
> URL: https://issues.apache.org/jira/browse/YARN-10355
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Benjamin Teke
>Priority: Minor
>
> The 
> {{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch#orderEnvByDependencies}}
>  and it's helper method \{{getEnvDependencies }}(together with the overrides) 
> is hard to read. Some improvements could be made:
>  * use Pattern matching in the overrides of getEnvDependencies instead of 
> iterating through the environmental variable strings char by char
>  * the unit tests contains a lot of repeated code and generally the test 
> methods are long - they could be separated into different setup/helper and 
> assertion methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10355) Refactor NM ContainerLaunch.java#orderEnvByDependencies

2020-07-20 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-10355:


 Summary: Refactor NM ContainerLaunch.java#orderEnvByDependencies
 Key: YARN-10355
 URL: https://issues.apache.org/jira/browse/YARN-10355
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Benjamin Teke


The 
{{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch#orderEnvByDependencies}}
 and it's helper method {{getEnvDependencies }}(together with the overrides) is 
hard to read. Some improvements could be made:
 * use Pattern matching in the overrides of getEnvDependencies instead of 
iterating through the environmental variable strings char by char
 * the unit tests contains a lot of repeated code and generally the test 
methods are long - it could be separated into different setup/helper and 
assertion methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161342#comment-17161342
 ] 

Hadoop QA commented on YARN-9136:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 34m 
59s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} markdownlint {color} | {color:blue}  0m  
0s{color} | {color:blue} markdownlint was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
47m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 14s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 99m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26294/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-9136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008016/YARN-9136.001.patch |
| Optional Tests | dupname asflicense mvnsite markdownlint |
| uname | Linux 4d3db621685d 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 
11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 9f407bcc88a |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/26294/artifact/out/whitespace-eol.txt
 |
| Max. process+thread count | 308 (vs. ulimit of 5500) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/26294/console |
| versions | git=2.17.1 maven=3.6.0 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Attachments: YARN-9136.001.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-20 Thread Sushil Ks (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-10315:
-
Attachment: YARN-10315.002.patch

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10106) Yarn logs CLI filtering by application attempt

2020-07-20 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161306#comment-17161306
 ] 

Adam Antal commented on YARN-10106:
---

+1 from me.

> Yarn logs CLI filtering by application attempt
> --
>
> Key: YARN-10106
> URL: https://issues.apache.org/jira/browse/YARN-10106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Hudáky Márton Gyula
>Priority: Trivial
> Attachments: YARN-10106.001.patch, YARN-10106.002.patch, 
> YARN-10106.003.patch, YARN-10106.004.patch, YARN-10106.005.patch, 
> YARN-10106.006.patch, YARN-10106.007.patch, YARN-10106.008.patch, 
> YARN-10106.009.patch, YARN-10106.010.patch
>
>
> {{ContainerLogsRequest}} got a new parameter in YARN-10101, which is the 
> {{applicationAttempt}} - we can use this new parameter in Yarn logs CLI as 
> well to filter by application attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161295#comment-17161295
 ] 

Hadoop QA commented on YARN-10352:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  2s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
45s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
42s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 32s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 97 unchanged - 0 fixed = 98 total (was 97) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
47s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 16s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}182m 24s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  node must be non-null but is marked as nullable  At 
MultiNodeSortingManager.java:is marked as nullable  At 
MultiNodeSortingManager.java:[lines 124-125] |
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26293/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10352 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008000/YARN-10352-002.patch |
| Optional Tests | dupname 

[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-20 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161286#comment-17161286
 ] 

Jim Brennan commented on YARN-10343:


Thanks for the updates and explanations [~epayne]!  I am +1 (non-binding) on 
patch 001.

 

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9136) getNMResourceInfo NodeManager REST API method is not documented

2020-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/YARN-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161266#comment-17161266
 ] 

Hudáky Márton Gyula commented on YARN-9136:
---

Patch 1 only contains the description of the API, we need to add the example of 
usage.

> getNMResourceInfo NodeManager REST API method is not documented
> ---
>
> Key: YARN-9136
> URL: https://issues.apache.org/jira/browse/YARN-9136
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Hudáky Márton Gyula
>Priority: Major
> Attachments: YARN-9136.001.patch
>
>
> I cannot find documentation for the resources endpoint in NMWebServices: 
> /ws/v1/node/resources/\{resourcename\}
> I looked in the file NodeManagerRest.md for documentation but haven't found 
> any.
> This is supposedly unintentionally not documented: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRest.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161254#comment-17161254
 ] 

Hadoop QA commented on YARN-10278:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 30m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.3 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
55s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  9s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} branch-3.3 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
46s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
44s{color} | {color:green} branch-3.3 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 33s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 82 new + 70 unchanged - 42 fixed = 152 total (was 112) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 92m 
26s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}196m 32s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26292/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10278 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13007996/YARN-10278.branch-3.3.001.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 486b8b8ceb5f 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | branch-3.3 / 5aa9396 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~16.04-b09 |
| checkstyle | 

[jira] [Updated] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-20 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10315:
--
Summary: Avoid sending RMNodeResourceupdate event if resource is same  
(was: Avoid sending RMNodeResoureupdate event if resource is same)

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161163#comment-17161163
 ] 

Prabhu Joseph commented on YARN-10352:
--

Thanks [~bibinchundatt] for the inputs.

1. Removed iterating the nodes from {{ClusterNodeTracker}} and moved the 
filtering logic to {{CapacityScheduler}}.

2. Have added filter logic while returning the {{preferrednodeIterator}}.

3. {{reSortClusterNodes}} need not filter as at the end 
{{preferrednodeIterator}} does the same. Let me know if this is fine.





> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-002.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161114#comment-17161114
 ] 

Bibin Chundatt commented on YARN-10315:
---

Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me ..

[~adam.antal] any comments 

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161114#comment-17161114
 ] 

Bibin Chundatt edited comment on YARN-10315 at 7/20/20, 10:17 AM:
--

Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me .. Fix the whitespace errors..

[~adam.antal] any comments 


was (Author: bibinchundatt):
Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me ..

[~adam.antal] any comments 

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161100#comment-17161100
 ] 

Hadoop QA commented on YARN-10315:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
2s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  6s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
45s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 88m 10s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26291/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10315 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13007991/YARN-10315.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux b7b7938e57cf 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 2cec50cf165 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| whitespace | 

[jira] [Updated] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-07-20 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10278:
--
Attachment: YARN-10278.branch-3.3.001.patch

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, 
> YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, 
> YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-07-20 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161040#comment-17161040
 ] 

Szilard Nemeth commented on YARN-10278:
---

Hi [~epayne],
Thanks for taking care of reviewing the patches I've uploaded.
It seems like branch-2.10 would involve more work as it has some conflicts.
Also, 2.10 builds are not triggering, I saw this phenomenon while looking at 
some other jiras.
Do you want to stick to the 2.10 patch?

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, 
> YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, 
> YARN-10278.branch-3.2.002.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9016) DocumentStore as a backend for ATSv2

2020-07-20 Thread Sushil Ks (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9016:

Description: 
h1. Document Store for ATSv2

               The Document Store for ATSv2 is a framework for plugging in any 
Document Store Vendor as a backend for ATSv2 i.e Azure CosmosDB , MongoDB, 
ElasticSearch etc.
 * Supports multiple Document Store Vendors like CosmosDB, ElasticSearch, 
MongoDB etc by just adding new configurations properties and writing Document 
Store reader and writer clients.
 * Currently has support for CosmosDB.
 * All writes are Async and buffered, latest document would be flushed to the 
store either if the document buffer gets full or periodically at every flush 
interval in background without adding any additional latency to the running 
jobs..
 * All the REST API's of Timeline Reader Server are supported.

h3. *How to enable?*

Add the flowing properties under *yarn-site.xml*
{code:java}


 yarn.timeline-service.writer.class 
 
org.apache.hadoop.yarn.server.timelineservice.documentstore.DocumentStoreTimelineWriterImpl



   yarn.timeline-service.reader.class  
org.apache.hadoop.yarn.server.timelineservice.documentstore.DocumentStoreTimelineReaderImpl


 
   yarn.timeline-service.document-store.db-name   
   YOUR_DATABASE_NAME  
{code}
h3. *Creating DB and Collections for storing documents*

                      The following config needs to be set inside 
*yarn-site.xml* for creating the database and collections for storing documents.
{code:java}


 yarn.timeline-service.schema-creator.class 
 
org.apache.hadoop.yarn.server.timelineservice.documentstore.DocumentStoreCollectionCreator
{code}
            Running the schema creator tool to create the necessary collections.
{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator{code}
h3.  *Azure CosmosDB* 

       To use Azure CosmosDB as a DocumentStore for ATSv2, the additional 
properties under *yarn-site.xml* is required..
{code:java}

  
   yarn.timeline-service.document-store-type  
   COSMOS_DB



   yarn.timeline-service.document-store.cosmos-db.endpoint
   http://YOUR_AZURE_COSMOS_DB_URL:443/



   yarn.timeline-service.document-store.cosmos-db.masterkey
   YOUR_AZURE_COSMOS_DB_MASTER_KEY_CREDENTIAL

{code}
 
   *Testing locally*
               In order to test the Azure CosmosDB as a DocumentStore locally, 
install the emulator from 
[here|https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator] and 
start it locally. Set the endpoint and master key under *yarn-site.xml* as 
mentioned above and run any example job like DistributedShell etc. Later you 
can check the data explorer UI of Azure CosmosDB locally to query the documents 
or even launch the *TimelineReader* locally to fetch/query the data from REST 
API's.   
                   

      

  was:
h1. Document Store for ATSv2

               The Document Store for ATSv2 is a framework for plugging in any 
Document Store Vendor as a backend for ATSv2 i.e Azure CosmosDB , MongoDB, 
ElasticSearch etc.
 * Supports multiple Document Store Vendors like CosmosDB, ElasticSearch, 
MongoDB etc by just adding new configurations properties and writing Document 
Store reader and writer clients.
 * Currently has support for CosmosDB.
 * All writes are Async and buffered, latest document would be flushed to the 
store either if the document buffer gets full or periodically at every flush 
interval in background without adding any additional latency to the running 
jobs..
 * All the REST API's of Timeline Reader Server are supported.

h3. *How to enable?*

Add the flowing properties under *yarn-site.xml*
{code:java}


 yarn.timeline-service.writer.class 
 
org.apache.hadoop.yarn.server.timelineservice.storage.documentstore.DocumentStoreTimelineWriterImpl



   yarn.timeline-service.reader.class  
org.apache.hadoop.yarn.server.timelineservice.storage.documentstore.DocumentStoreTimelineReaderImpl


 
   yarn.timeline-service.document-store.db-name   
   YOUR_DATABASE_NAME  
{code}
h3. *Creating DB and Collections for storing documents*

                      The following config needs to be set inside 
*yarn-site.xml* for creating the database and collections for storing documents.
{code:java}


 yarn.timeline-service.schema-creator.class 
 
org.apache.hadoop.yarn.server.timelineservice.documentstore.DocumentStoreCollectionCreator
{code}
            Running the schema creator tool to create the necessary collections.
{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator{code}
h3.  *Azure CosmosDB* 

       To use Azure CosmosDB as a DocumentStore for ATSv2, the additional 
properties under *yarn-site.xml* is required..
{code:java}

  
   yarn.timeline-service.document-store-type  
   COSMOS_DB



   yarn.timeline-service.document-store.cosmos-db.endpoint
   

[jira] [Comment Edited] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936
 ] 

Bibin Chundatt edited comment on YARN-10352 at 7/20/20, 6:27 AM:
-

[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration than 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 


was (Author: bibinchundatt):
[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration that 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936
 ] 

Bibin Chundatt commented on YARN-10352:
---

[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration that 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org