[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2021-04-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316489#comment-17316489
 ] 

Jim Brennan commented on YARN-10475:


[~chaosju] thanks for your comment.  The implementation we provided here is 
using overall cluster utilization vs node utilization to adjust the heartbeat 
so that under-utilized nodes get more scheduling opportunities.  Note that this 
feature was developed internally on branch-2 before the global scheduler was 
added.   It has worked well to help keep our nodes more evenly utilized. 

I think that other metrics for scaling the heartbeat are definitely worth 
exploring, which is why we filed [YARN-10478] to make it pluggable.  That would 
be a good place to make suggestions for alternate approaches.


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316478#comment-17316478
 ] 

Jim Brennan commented on YARN-10702:


Thanks again [~ebadger]!  I put up additional patches for branch-3.2 and 
branch-3.1. 


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.2.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.1.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315597#comment-17315597
 ] 

Jim Brennan commented on YARN-10702:


The failed unit test for branch-3.3 is unrelated.  Looks like it was fixed in 
[YARN-10337], which was only committed to trunk.


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315196#comment-17315196
 ] 

Jim Brennan commented on YARN-10702:


Thanks [~ebadger]!  I have put up a patch for branch-3.3.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.3.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314912#comment-17314912
 ] 

Jim Brennan commented on YARN-10702:


Not going to fix these checkstyle issues, because the new variables match the 
pattern for the others.
{noformat}

./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java:66:
rmEventProcCPUAvg;:5: Variable 'rmEventProcCPUAvg' must be private and have 
accessor methods. [VisibilityModifier]
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java:68:
rmEventProcCPUMax;:5: Variable 'rmEventProcCPUMax' must be private and have 
accessor methods. [VisibilityModifier]
 {noformat}

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314651#comment-17314651
 ] 

Jim Brennan commented on YARN-10702:


patch 006 addresses the checkstyle/spotbug issues.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-04 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314086#comment-17314086
 ] 

Jim Brennan commented on YARN-10702:


Patch 005 adds a configuration property for this:
{noformat}

  
Resource manager dispatcher thread cpu monitor sampling rate.
Units are samples per minute.  This controls how often to sample
the cpu utilization of the resource manager dispatcher thread.
The cpu utilization is displayed on the RM UI as scheduler busy %.
Set this to zero to disable the dispatcher thread monitor.  Defaults
to 60 samples per minute.
  
  yarn.dispatcher.cpu-monitor.samples-per-min
  60

 {noformat}
If it is disabled by setting this property to zero, the UI shows "N/A" for the 
Scheduler Busy value, to distinguish it from 0, which is a valid avg cpu usage 
for the thread on a lightly loaded cluster.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.005.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307114#comment-17307114
 ] 

Jim Brennan commented on YARN-10697:


Thanks for the update [~BilwaST]!  I am +1 on patch 003.  [~epayne]. [~jhung], 
if there are no objections I will commit this later today.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> YARN-10697.003.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306489#comment-17306489
 ] 

Jim Brennan edited comment on YARN-10697 at 3/22/21, 7:02 PM:
--

Thanks for the update [~BilwaST]!
(edited) patch 002 looks mostly good, but can you please rename getResources()? 
 There is already a public Resource.getResources(), and the two functions are 
completely different.
Maybe the private one should be called getFormattedString()?   The new public 
one could also be getFormattedString().




was (Author: jim_brennan):
Thanks for the update [~BilwaST]!  +1 patch 002 looks good to me.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306489#comment-17306489
 ] 

Jim Brennan commented on YARN-10697:


Thanks for the update [~BilwaST]!  +1 patch 002 looks good to me.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305110#comment-17305110
 ] 

Jim Brennan commented on YARN-10702:


Thanks for the suggestions [~gandras]!  I agree this should be configurable.  I 
will put up a new patch with those changes.

I don't think the new thread has a significant impact.  I wasn't trying to 
measure that, but when I was looking at an RM recently where the dispatcher 
thread was very busy, the monitoring thread did not appear to be a significant 
factor, it was popping up as using less than 10% of a single CPU for brief 
periods of time IIRC.  I'll have to take a closer look.  But I think making the 
sampling rate configurable is a good idea.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304941#comment-17304941
 ] 

Jim Brennan commented on YARN-10697:


{quote}
So can we introduce a new method in Resource.java which can print it in 
MB|GB|TB?
{quote}
[~BilwaST] I think that is a good suggestion.  There are places where this 
format would be nice.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.004.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304250#comment-17304250
 ] 

Jim Brennan commented on YARN-10702:


Jumped the gun.  Patch 004 has fixes for the other checkstyle issues.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304240#comment-17304240
 ] 

Jim Brennan commented on YARN-10702:


Thanks for the review [~zhuqi]!  patch 003 fixes the method names as suggested.


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.003.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303747#comment-17303747
 ] 

Jim Brennan commented on YARN-10697:


[~BilwaST] I agree about the bug in MetricsOverviewTable.render().  Unless I am 
misunderstanding, the else case is improperly using bytes where it should be 
using MB.  I am not sure about the change to Resource.toString() though.  That 
is used in a lot of places and I am not sure if all of those places would 
prefer the terser MB|GB|TB format.  [~epayne], [~jhung] what do you think?


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303724#comment-17303724
 ] 

Jim Brennan commented on YARN-10702:


patch 002 is rebased to current trunk.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.002.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: Scheduler-Busy.png

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: simon-scheduler-busy.png

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303697#comment-17303697
 ] 

Jim Brennan commented on YARN-10702:


Attaching some images of how this looks on the RM legacy UI and also the new 
metrics in simon.
 !Scheduler-Busy.png! 



> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.001.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10702.001.patch
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10702:
--

 Summary: Add cluster metric for amount of CPU used by RM Event 
Processor
 Key: YARN-10702
 URL: https://issues.apache.org/jira/browse/YARN-10702
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Add a cluster metric to track the cpu usage of the ResourceManager Event 
Processing thread.   This lets us know when the critical path of the RM is 
running out of headroom.
This feature was originally added for us internally by [~nroberts] and we've 
been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300968#comment-17300968
 ] 

Jim Brennan commented on YARN-10588:


Yes.  I think it is ok to do it as a separate Jira.  We may actually want to 
keep {{isAllInvalidDivisor}} as is, and change {{isInvalidDivisor}} to only 
consider the countable resources.

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300369#comment-17300369
 ] 

Jim Brennan commented on YARN-10687:


Thanks for the updates [~zhuqi]!  I am +1 on patch 004.   I will commit this 
later today.

> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch, YARN-10687.002.patch, 
> YARN-10687.003.patch, YARN-10687.004.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299827#comment-17299827
 ] 

Jim Brennan commented on YARN-10687:


Thanks for updating [~zhuqi]!  Can you please add a unit test to 
TestDirectoryCollection?  Also, did you see [~gandras]'s suggestion about 
changing the property names?  e.g., {{disk-utilization-percentage.enabled}}?

> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch, YARN-10687.002.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.

2021-03-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299638#comment-17299638
 ] 

Jim Brennan commented on YARN-10688:


[~zhuqi] we are very interested in this feature.  [~ebadger] can you take a 
look?


> ClusterMetrics should support GPU related metrics.
> --
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299613#comment-17299613
 ] 

Jim Brennan commented on YARN-10687:


Thanks [~zhuqi]!  These are not strictly needed, because the values can be set 
to effectively enable/disable each threshold.  But I agree that having these 
makes the configuration options easier to understand.  In addition the the name 
changes recommended by [~gandras], we should also update the descriptions for 
the threshold properties to indicate that they only apply when the 
corresponding {{enabled}} property is true.


> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-10 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298991#comment-17298991
 ] 

Jim Brennan commented on YARN-10588:


This looks good, but I have one question about 
{{DominantResourceCalculator.isAllInvalidDivisor()}}.
Looking at the {{divide()}} and {{ratio()}} methods for that class, I wonder if 
we should be looping over the first 
{{ResourceUtils.getNumberOfCountableResourceTypes()}} resource types instead of 
all of them.  It may be moot at this point, if all resource types are 
countable.  But if there are non-countable resource types, I think it would be 
incorrect as-is.


> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297701#comment-17297701
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the reviews and the commits [~ebadger]!

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297522#comment-17297522
 ] 

Jim Brennan commented on YARN-10664:


Thanks [~ebadger]!  I have put up a patch for branch-3.2.I'm not sure it is 
worth trying to pull back further than that, because we start running into more 
conflicts with other changes that have not been pulled back.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664-branch-3.2.004.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296093#comment-17296093
 ] 

Jim Brennan commented on YARN-8786:
---

I am ok with closing it. 

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295623#comment-17295623
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the review [~ebadger]!  I have put up patch 004 to address the 
concern about expanding the keystore/truststore keys.

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.004.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295589#comment-17295589
 ] 

Jim Brennan commented on YARN-10664:


It would have to be an environment variable that contains a variable denoted 
with the expansion characters, double squigglies.  Looking through 
{{santizeEnv()}}, I don't see anything that would potentially include those, 
other than {{NM_ADMIN_USER_ENV}} itself.   If someone was using those in an 
{{NM_ADMIN_USER_ENV}}-defined variable, and depending on that not being 
expanded, that would be a problem.  But I don't think that is likely.

Looking back at the {{call()}} method, there might be a problem there.  There 
is code that adds {{KEYSTORE_PASSWORD_ENV_NAME}} to the environment.  If it is 
possible for that value to have those characters, that might cause an unwanted 
expansion.  Might need to move the setting of those truststore variables after 
the call to {{expandEnvironment}}.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10665) TestContainerManagerRecovery sometimes fails

2021-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295541#comment-17295541
 ] 

Jim Brennan commented on YARN-10665:


I am going to close this as invalid.  I have not been able to reproduce since 
rebooting my mac, and on further analysis, I believe the hard-coded value of 
49160 was explicitly chosen to be beyond the start of the ephemeral range for 
all platforms.


> TestContainerManagerRecovery sometimes fails
> 
>
> Key: YARN-10665
> URL: https://issues.apache.org/jira/browse/YARN-10665
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10665.001.patch
>
>
> TestContainerManagerRecovery sometimes fails when I run it on the mac because 
> it cannot bind to a port.   I believe this is because it calls getPort with a 
> hard-coded port number (49160) instead of just passing zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.003.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295518#comment-17295518
 ] 

Jim Brennan commented on YARN-10664:


Trying to put up patch 003, which is identical to patch 002, to see if it 
triggers precommit builds.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10665) TestContainerManagerRecovery sometimes fails

2021-03-03 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10665:
---
Summary: TestContainerManagerRecovery sometimes fails  (was: 
TestContainerManagerRecover sometimes fails)

> TestContainerManagerRecovery sometimes fails
> 
>
> Key: YARN-10665
> URL: https://issues.apache.org/jira/browse/YARN-10665
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
>
> TestContainerManagerRecovery sometimes fails when I run it on the mac because 
> it cannot bind to a port.   I believe this is because it calls getPort with a 
> hard-coded port number (49160) instead of just passing zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294032#comment-17294032
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the review [~ebadger]!  I put up patch 002 to remove the 
TestContainerManagerRecovery change.   I have filed [YARN-10665] to address 
that issue.

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10665) TestContainerManagerRecover sometimes fails

2021-03-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10665:
--

 Summary: TestContainerManagerRecover sometimes fails
 Key: YARN-10665
 URL: https://issues.apache.org/jira/browse/YARN-10665
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


TestContainerManagerRecovery sometimes fails when I run it on the mac because 
it cannot bind to a port.   I believe this is because it calls getPort with a 
hard-coded port number (49160) instead of just passing zero.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.002.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.001.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10664:
--

 Summary: Allow parameter expansion in NM_ADMIN_USER_ENV
 Key: YARN-10664
 URL: https://issues.apache.org/jira/browse/YARN-10664
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
expansion.  That is, you cannot specify an environment variable such as 
{code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside the 
container.

We have a need for this in specifying different java gc options for java 
processing running inside yarn containers based on which version of java is 
being used.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291064#comment-17291064
 ] 

Jim Brennan commented on YARN-10613:


[~epayne], I have committed to trunk and branch-3.3, but the patch does not 
work for branch-3.2.  I can get it to apply, but then compilation fails.  Can 
you put up a patch for branch-3.2, and branch-3.1 if needed?

 

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290520#comment-17290520
 ] 

Jim Brennan commented on YARN-10613:


Thanks for the update [~epayne].  This looks good to me.  +1 on patch 002.

I will wait for the pre-commit build to finish before committing this.

 

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290079#comment-17290079
 ] 

Jim Brennan commented on YARN-10613:


Thanks [~epayne]! The patch looks good to me but for two minor issues:
 # In ProportionalCapacityPreemptionPolicy, I think you need to add a {{"\n"}} 
between the new lines.
 # In the new inter-queue test, I think you should explicitly set the property 
to true instead of relying on 
{{DEFAULT_CROSS_QUEUE_PREEMPTION_CONSERVATIVE_DRF}} to be true (line 237).  
Same applies for the first part of the Intra-Queue test, you should explicitly 
set the property to true - it's currently not set at all.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289400#comment-17289400
 ] 

Jim Brennan commented on YARN-10613:


I think you should stick with INTRA_QUEUE_PREEMPTION_CONFIG_PREFIX instead of 
adding the new one.
We already have that redundancy for other configs.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289325#comment-17289325
 ] 

Jim Brennan commented on YARN-10613:


I like your proposal for the more easily distinguished property names.


> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10626) Log resource allocation in NM log at container start time

2021-02-16 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285299#comment-17285299
 ] 

Jim Brennan commented on YARN-10626:


+1. This looks good to me [~ebadger]!  I agree we don't need a unit test.  I 
will commit today.


> Log resource allocation in NM log at container start time
> -
>
> Key: YARN-10626
> URL: https://issues.apache.org/jira/browse/YARN-10626
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10626.001.patch, YARN-10626.002.patch
>
>
> As far as I can tell, there are no resource allocation logs in the NM log for 
> the various containers that are scheduled. These can be useful when trying to 
> debug what resources were requested vs what resources were actually 
> allocated. This is especially useful when debugging upstream technology 
> changes to make sure that they are correctly interpreting and passing down 
> resource parameters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5853) TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently on Power

2021-02-11 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-5853.
---
Resolution: Duplicate

This is fixed by YARN-10500

> TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently 
> on Power
> --
>
> Key: YARN-5853
> URL: https://issues.apache.org/jira/browse/YARN-5853
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
> Environment: # uname -a
> Linux pts00452-vm10 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 
> 2015 ppc64le ppc64le ppc64le GNU/Linux
> # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: Yussuf Shaikh
>Priority: Major
>
> The test testRMRestartWithExpiredToken fails intermittently with the 
> following error:
> Stacktrace:
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken(TestDelegationTokenRenewer.java:1060)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283357#comment-17283357
 ] 

Jim Brennan commented on YARN-10500:


Thanks for the update [~iwasakims]!  I have committed to trunk and will 
cherry-pick to other branches.


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-10 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282670#comment-17282670
 ] 

Jim Brennan commented on YARN-10500:


Actually I missed noticing there were some check-style issues.   [~iwasakims]. 
can you please fix those?  And while you are at it, there is an unneeded 
{{throws Exception}} on {{testShutdown()}}.  Can you remove that as well?


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-10 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282664#comment-17282664
 ] 

Jim Brennan commented on YARN-10500:


+1 Thanks for fixing this [~iwasakims]!   I will commit shortly.


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-02-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279997#comment-17279997
 ] 

Jim Brennan commented on YARN-10588:


Thanks for reporting this and putting up the patch [~BilwaST]!  I wonder if a 
better fix would be to change {{DominantResourceCalculator.isInvalidDivisor()}} 
to be consistent with  {{DominantResourceCalculator.divide()}}?   Currently it 
returns true if any resource is zero, while {{divide}} is only going to return 
zero if all of the countable ones are zero.

[~epayne] what do you think?


> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279810#comment-17279810
 ] 

Jim Brennan commented on YARN-10607:


Thanks for the updates [~ebadger].  +1 This looks good to me.
I will commit today.

> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch, 
> YARN-10607.003.patch, YARN-10607.004.patch, YARN-10607.004.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279200#comment-17279200
 ] 

Jim Brennan commented on YARN-10607:


Thanks [~ebadger].  This looks good overall, but I have one comment.  The unit 
test you added only runs on Windows, which I think is correct, but the code in 
ContainerLaunch.sanitizeEnv() runs on
 windows as well.  I'm not sure this feature really makes sense on windows, and 
in particular this line is certainly not correct for windows:
{noformat}
Apps.addToEnvironment(environment, Environment.PATH.name(),
"$PATH", File.pathSeparator);
{noformat}
My suggestion is to put the force-path code inside a {{!Shell.WINDOWS}} check.
We might want to update the documentation as well to note it is ignored on 
windows.

> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch, 
> YARN-10607.003.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279016#comment-17279016
 ] 

Jim Brennan commented on YARN-10607:


Thanks [~ebadger]!  Patch 002 seems to have an extraneous change to 
TestCapacitySchedulerMultiNodes.java.  Any idea where that came from?


> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra-queue preemption to enable/disable conservativeDRF

2021-02-03 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278349#comment-17278349
 ] 

Jim Brennan commented on YARN-10613:


[~epayne] any reason we shouldn't add a property for inter-queue-preemption as 
well, so that both are configurable?

> Config to allow Intra-queue preemption to  enable/disable conservativeDRF
> -
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of the following config property:
> {code:xml}
>   
> 
> yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
> true
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

2021-01-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267945#comment-17267945
 ] 

Jim Brennan commented on YARN-10562:


Thanks [~ebadger]!

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

2021-01-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266098#comment-17266098
 ] 

Jim Brennan commented on YARN-10562:


Given that the original bug exists in branch-2 as well, I think back-porting to 
branch-2 is a good idea in this case.

 

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264327#comment-17264327
 ] 

Jim Brennan commented on YARN-4589:
---

[~epayne], I have attached a patch for branch-3.2.  I have also verified that 
it applies cleanly to branch-3.1.

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589-branch-3.2.001.patch, YARN-4589.004.patch, 
> YARN-4589.005.patch, YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589-branch-3.2.001.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589-branch-3.2.001.patch, YARN-4589.004.patch, 
> YARN-4589.005.patch, YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264306#comment-17264306
 ] 

Jim Brennan commented on YARN-4589:
---

Thanks [~epayne]!  I will put up a patch for branch-3.2.

 

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589.005.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263700#comment-17263700
 ] 

Jim Brennan commented on YARN-4589:
---

patch 005 removes the extra file.

 

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Summary: Follow up changes for YARN-9833  (was: Alternate fix for 
DirectoryCollection.checkDirs() race)

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263696#comment-17263696
 ] 

Jim Brennan commented on YARN-10562:


Patch 004 replaces the {{CopyOnWriteArrayLists}} with {{ArrayLists}}.  It also 
fixes {{getErroredDirs()}} to use {{ImmutableList.copyOf()}} instead of 
{{Collections.unmodifiableList()}}.

One additional change I made was to change {{getFailedDirs()}} to use 
{{Collections.unmodifiableList()}} instead of {{ImmutableList.copyOf()}}.  
There is no need to make another copy in this case, because 
{{DirectoryCollection.concat()}} was already constructing a new list.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.004.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263691#comment-17263691
 ] 

Jim Brennan commented on YARN-4589:
---

I don't think I need to add a unit test for this, as it is only adding a log 
message.
I believe the other unit tests are unrelated, but I will double-check.
It looks like I included a stray file - I will remove it and put up another 
patch.


> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263462#comment-17263462
 ] 

Jim Brennan commented on YARN-10562:


Thanks for the discussion and comment [~ebadger]!  I agree that probably the 
best approach is to remove the use of CopyOnWriteArrayList and stick with 
simple ArrayLists.  We can preserve the changes made in [YARN-9833] to return 
copies of the lists and fix that issue with GetErrorDirs.

I don't think the cost of the copies, even for every launch, is really much of 
a concern in the grand scope of things.My inclination is to make the 
minimum changes for this - that is, not rewrite checkDirs() as I've done here - 
it is not nearly as inefficient with ArrayLists as it was with 
CopyOnWriteArrayLists.

I will put up another patch with these changes.  We'll probably want to change 
the Summary as well to indicate this is just a follow-on to [YARN-9833], not an 
alternate solution.  If you'd prefer I file a new Jira instead, let me know.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262933#comment-17262933
 ] 

Jim Brennan commented on YARN-4589:
---

Forgot to attach the patch.  Doh!


> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-11 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589.004.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.003.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261403#comment-17261403
 ] 

Jim Brennan commented on YARN-10562:


patch 003 fixes the new checkstyle issues.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260614#comment-17260614
 ] 

Jim Brennan commented on YARN-10562:


Submitted patch 002 to fix the checkstyle issues and add unit tests that 
[~ebadger] wrote when he was trying to verify this race condition.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-07 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.002.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2021-01-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260089#comment-17260089
 ] 

Jim Brennan commented on YARN-9833:
---

[~ebadger], [~pbacsko], I filed a new Jira so I could put up the alternate 
solution [YARN-10562].

If we decide not to go with that approach, then I think we should file a 
follow-up Jira to fix the errorDirs issue and while we are at it, remove the 
use of CopyOnWriteArrayList for these, as it is pretty wasteful to use it now.


> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-06 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: (was: YARN-9833.001.patch)

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-06 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10562:
--

 Summary: Alternate fix for DirectoryCollection.checkDirs() race
 Key: YARN-10562
 URL: https://issues.apache.org/jira/browse/YARN-10562
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
related methods were returning an unmodifiable view of the lists. These 
accesses were protected by read/write locks, but because the lists are 
CopyOnWriteArrayLists, subsequent changes to the list, even when done under the 
writelock, were exposed when a caller started iterating the list view. 
CopyOnWriteArrayLists cache the current underlying list in the iterator, so it 
is safe to iterate them even while they are being changed - at least the view 
will be consistent.

The problem was that checkDirs() was clearing the lists and rebuilding them 
from scratch every time, so if a caller called getGoodDirs() just before 
checkDirs cleared it, and then started iterating right after the clear, they 
could get an empty list.

The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
return a copy of the list, which definitely fixes the race condition. The 
disadvantage is that now we create a new copy of these lists every time we 
launch a container. The advantage using CopyOnWriteArrayList was that the lists 
should rarely ever change, and we can avoid all the copying. Unfortunately, the 
way checkDirs() was written, it guaranteed that it would modify those lists 
multiple times every time.

So this Jira proposes an alternate solution for YARN-9833, which mainly just 
rewrites checkDirs() to minimize the changes to the underlying lists. There are 
still some small windows where a disk will have been added to one list, but not 
yet removed from another if you hit it just right, but I think these should be 
pretty rare and relatively harmless, and in the vast majority of cases I 
suspect only one disk will be moving from one list to another at any time.   
The question is whether this type of inconsistency (which was always there 
before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254133#comment-17254133
 ] 

Jim Brennan commented on YARN-10540:


Thanks [~sunilg]!

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Screenshot 2020-12-23 at 
> 8.24.42 PM.png, YARN-10540.001.patch, Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253560#comment-17253560
 ] 

Jim Brennan commented on YARN-10540:


Thanks [~ebadger]!  And thanks [~hexiaoqiao],  [~ayushtkn] and [~sunilg] for 
finding and investigating this bug.  Much appreciated.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10542) Node Utilization on UI is misleading if nodes don't report utilization

2020-12-21 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10542:
--

 Summary: Node Utilization on UI is misleading if nodes don't 
report utilization
 Key: YARN-10542
 URL: https://issues.apache.org/jira/browse/YARN-10542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan
Assignee: Jim Brennan


As reported in YARN-10540, if the ResourceCalculatorPlugin fails to initialize, 
the nodes will report no utilization.  This makes the RM UI misleading, because 
it presents cluster-wide and per node utilization as 0 instead of indicating 
that it is not being tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253117#comment-17253117
 ] 

Jim Brennan commented on YARN-10540:


I filed YARN-10542 as a follow-up.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10540:
--

Assignee: Jim Brennan

I have attached a patch that initializes nodeUtilization to a 
ResourceUtilization of all zeros instead of null.  This fixes the NPE.   In 
cases where ResourceCalculatorPlugin fails to initialize (like on Mac) the UI 
will be misleading in this case, always showing 0 utilization.  I will file a 
follow-up Jira to make this better.

 

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10540:
---
Attachment: YARN-10540.001.patch

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253090#comment-17253090
 ] 

Jim Brennan commented on YARN-10540:


I have manually reproduced this in trunk on a VM by setting the 
ResourceCalculatorPlugin to null.

I am testing out this fix and will put up a patch shortly.  This would fix the 
NPE, but the UI would report zeros in this case.  My proposal would be to 
follow-up with another Jira to improve the UI presentation in this case.

 

 

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253022#comment-17253022
 ] 

Jim Brennan commented on YARN-10540:


Simpler fix might be to initialize {{nodeUtilization}} in 
NodeResourceMonitorImpl to {{ResourceUtilization.newInstance(0, 0, 0f)}}

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253018#comment-17253018
 ] 

Jim Brennan commented on YARN-10540:


Just to clarify, we are seeing this only on Mac in branch-3.2.2?
Or are we seeing it on Mac on all branches?
Or on branch-3.2.2 on platforms?

The ResourceCalculatorPlugin issue on Mac would explain it failing on Mac for 
all branches.  If it's failing for Mac only on branch-3.2.2, that is 
interesting.




> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253003#comment-17253003
 ] 

Jim Brennan commented on YARN-10540:


Trying to figure out why we might get an NPE in this case.  One option to avoid 
the NPE might be to change SchedulerNode.setNodeUtilization() to check for null 
and either set to an empty resource (which is what we initialize it to), or 
just ignore the update.


> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249316#comment-17249316
 ] 

Jim Brennan commented on YARN-9833:
---

{quote}
My worry with this is that code changes in the future will incorrectly use 
getGoodDirs or the other methods that expose the private lists from within 
DirectoryCollection.
{quote}
Isn't this how it has been for years?  It was returning an unmodifiableList 
view of the underlying List, so that limits what the caller can do.  
getGoodDirs() and the others just return a read-only List.  They don't have to 
know about the internals.
If we are going to change these to return a copy of the list, we may want to 
reconsider the use of CopyOnWriteArrayList - I'm not sure it is buying us 
anything.


> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249043#comment-17249043
 ] 

Jim Brennan commented on YARN-9833:
---

Thinking about it more over the weekend, I suspect that the reason the 
CopyOnWriteArrayList was used was more for performance than to allow someone to 
hang onto the reference for a long time.  This ideally is a list that doesn't 
change very often, so handing out a view of a copy-on-write array is cheaper 
than making a copy every time we launch a container.  Unfortunately, 
{{checkdirs()}} as written seems to ruin any advantage we've gained by mutating 
the lists every time it runs (and multiple times at that, by first clearing and 
then adding each entry individually).   This is also where the race comes in.

My suggestion for fixing this would be to fix the {{checkdirs()}}  
implementation to operate on local copies of these arrays, and then update them 
with a single assignment only if they have changed.

 

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2020-12-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248173#comment-17248173
 ] 

Jim Brennan commented on YARN-10494:


[~ccondit], [~ebadger] I am OK with including this for now. 

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10494.001.patch, 
> docker-to-squashfs-conversion-tool-design.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   >