from:"Jim Brennan \(JIRA\)"

[jira] [Commented] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386458#comment-17386458
 ] 

Jim Brennan commented on YARN-10855:


Thanks [~zhuqi]!


> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10855.001.patch, YARN-10855.002.patch, 
> YARN-10855.003.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-16 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382331#comment-17382331
 ] 

Jim Brennan commented on YARN-10855:


patch 003 fixes the checkstyle issues.
[~epayne] can you please review this?


> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10855.001.patch, YARN-10855.002.patch, 
> YARN-10855.003.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-16 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10855:
---
Attachment: YARN-10855.003.patch

> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10855.001.patch, YARN-10855.002.patch, 
> YARN-10855.003.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-16 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10855:
---
Attachment: YARN-10855.002.patch

> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10855.001.patch, YARN-10855.002.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-16 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382114#comment-17382114
 ] 

Jim Brennan commented on YARN-10855:


Thanks for the review and the suggestion [~zhuqi]!  I will update the patch.

> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10855.001.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-15 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10855:
---
Attachment: YARN-10855.001.patch

> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10855.001.patch
>
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-15 Thread Jim Brennan (Jira)

Jim Brennan created YARN-10855:
--

 Summary: yarn logs cli fails to retrieve logs if any TFile is 
corrupt or empty
 Key: YARN-10855
 URL: https://issues.apache.org/jira/browse/YARN-10855
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.1, 2.10.1, 3.2.2, 3.4.0
Reporter: Jim Brennan


When attempting to retrieve yarn logs via the CLI command, it failed with the 
following stack trace (on branch-2.10):
{noformat}
yarn logs -applicationId application_1591017890475_1049740 > logs
20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
server 
Exception in thread "main" java.io.EOFException: Cannot seek to negative offset
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
at 
org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
{noformat}
The problem was that there was a zero-length TFile for one of the containers in 
the application aggregated log directory in hdfs.  When we removed the zero 
length file, {{yarn logs}} was able to retrieve the logs.

A corrupt or zero length TFile for one container should not prevent loading 
logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-15 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10855:
--

Assignee: Jim Brennan

> yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
> -
>
> Key: YARN-10855
> URL: https://issues.apache.org/jira/browse/YARN-10855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2, 2.10.1, 3.4.0, 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> When attempting to retrieve yarn logs via the CLI command, it failed with the 
> following stack trace (on branch-2.10):
> {noformat}
> yarn logs -applicationId application_1591017890475_1049740 > logs
> 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
> 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
> server 
> Exception in thread "main" java.io.EOFException: Cannot seek to negative 
> offset
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
>   at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
>   at 
> org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
>   at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
> {noformat}
> The problem was that there was a zero-length TFile for one of the containers 
> in the application aggregated log directory in hdfs.  When we removed the 
> zero length file, {{yarn logs}} was able to retrieve the logs.
> A corrupt or zero length TFile for one container should not prevent loading 
> logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2021-07-14 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380864#comment-17380864
 ] 

Jim Brennan commented on YARN-10456:


Thanks [~epayne]!   The patch looks good and it matches the change we have been 
running with internally.

I am +1 on this and I will commit tomorrow if there are no objections.

> RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics 
> registry
> -
>
> Key: YARN-10456
> URL: https://issues.apache.org/jira/browse/YARN-10456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.3.0, 3.2.1, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10456.001.patch
>
>
> Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
> working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10542) Node Utilization on UI is misleading if nodes don't report utilization

2021-07-08 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10542:
--

Assignee: (was: Jim Brennan)

> Node Utilization on UI is misleading if nodes don't report utilization
> --
>
> Key: YARN-10542
> URL: https://issues.apache.org/jira/browse/YARN-10542
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Priority: Major
>
> As reported in YARN-10540, if the ResourceCalculatorPlugin fails to 
> initialize, the nodes will report no utilization.  This makes the RM UI 
> misleading, because it presents cluster-wide and per node utilization as 0 
> instead of indicating that it is not being tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-28 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10834:
---
Fix Version/s: 3.3.2
   3.2.3
   3.4.0

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: YARN-10834.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-28 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370681#comment-17370681
 ] 

Jim Brennan commented on YARN-10834:


[~epayne], I have committed this to trunk - branch-3.2, but the patch does not 
apply to branch-2.10.  Can you provide a patch for 2.10?

 

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10834.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-28 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370641#comment-17370641
 ] 

Jim Brennan commented on YARN-10834:


Thanks for finding this and providing a fix!

+1 The patch looks good to me.

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10834.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10824) Title not set for JHS and NM webpages

2021-06-18 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365674#comment-17365674
 ] 

Jim Brennan commented on YARN-10824:


Good catch!  One comment on the code:
I'm not sure "{{About the Node}}" is a good title for the node page.  Maybe 
"\{{Node Info}}"?

What do you think [~epayne]?

 

> Title not set for JHS and NM webpages
> -
>
> Key: YARN-10824
> URL: https://issues.apache.org/jira/browse/YARN-10824
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rajshree Mishra
>Assignee: Bilwa S T
>Priority: Major
> Attachments: JHS URL.jpg, NM URL.jpg, YARN-10824.001.patch
>
>
> The following issue was reported by one of our internal web security check 
> tools: 
> Passing a title to the jobHistoryServer(jhs) or Nodemanager(nm) pages using a 
> url similar to:
> [https://[hostname]:[jhs_port]/jobhistory/about?title=12345%27%22]
> or 
> [https://[hostname]:[nm_port]/node?title=12345]
> sets the page title to be set to the value mentioned.
> [Image attached]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363205#comment-17363205
 ] 

Jim Brennan commented on YARN-10767:


Thanks for the update [~dmmkr]! I can see that you changed
{noformat}
  public static String findActiveRMHAId(YarnConfiguration conf) {
YarnConfiguration yarnConf = new YarnConfiguration(conf);
{noformat}
to
{noformat}
  public static String findActiveRMHAId(YarnConfiguration yarnConf) {
{noformat}
Effectively moving the construction of the temporary YarnConfiguration to the 
caller.
 I see in the other place where this method is called, it was already doing 
that.
 So in that sense this make sense.

I am wondering about the change in behavior for findActiveRMHAId() though. 
Previously, it did not change the conf that was passed in - it made changes in 
a local copy. Now, it will modify the passed in conf whether it succeeds or 
fails, by setting RM_HA_ID.

That is why I suggested changing it to this:
{noformat}
  public static String findActiveRMHAId(Configuration conf) {
YarnConfiguration yarnConf = new YarnConfiguration(conf);
{noformat}
Then you can just use the conf you were passed in.

This does not make any functional difference with the current callers, but it 
could matter to future callers, if they assume findActiveRMHAId won't modify 
the passed in conf.

 

> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch, YARN-10767.002.patch, 
> YARN-10767.003.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-10 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361211#comment-17361211
 ] 

Jim Brennan commented on YARN-10767:


[~dmmkr] this looks good, but we should fix the spotbugs issue.  Seems like you 
could either instantiate a new YarnConfiguration from conf, or we could change 
RMHAUtils.findActiveRMHAId() to take a Configuration instead of 
YarnConfiguration.  The first thing it does is make a YarnConfiguration from it 
anyway.


> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch, YARN-10767.002.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357388#comment-17357388
 ] 

Jim Brennan commented on YARN-10767:


Thanks for your responses [~dmmkr]!   Based on your reply, I am in favor of 
simplifying the function to just try the action on the active RM, and if it 
fails, throw an exception.




> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-02 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355955#comment-17355955
 ] 

Jim Brennan commented on YARN-10767:


[~dmmkr], [~BilwaST], I am not familiar with the RM HA code, so it would be 
better to have someone who has worked in this area take a look. 
[~prabhujoseph], [~pbacsko]?

My observations:

I agree with the need for a null check because findActiveRMHAId can return 
null.  In this case though, maybe we just throw an exception.  Maybe 
findActiveRMHAId should actually throw instead of returning null?

I believe findActiveRMHAId is going to contact each RM to see if it is active, 
so won't this have the same time-out issues?  Or is there a different retry 
policy in this case?   Have you tested this solution to verify it resolves the 
problem?

I wonder if using findActiveRMHAId allows us to simplify this?  If we've 
already determined the active rm, do we really need to loop through the others 
if we fail on the one we know to be active?


> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7713) Add parallel copying of directories into FSDownload

2021-05-27 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352688#comment-17352688
 ] 

Jim Brennan commented on YARN-7713:
---

I'm not convinced that this is a good idea at all, for several reasons:
 # On our clusters, we rarely ever actually download a directory for 
localization. Based on scanning nodemanager logs on our clusters, the vast 
majority of localized files are individual files or archives. I think in 
general (at least here at Yahoo) it is not recommended to localize directories, 
because there are issues with tracking them - in particular, if they have 
subdirectories, changes in the subdirs will not be noticed. Since localizing 
directories is so rare, I don't think this optimization is worth the added 
complexity (at least in our use cases).
 # I agree with others that just splitting up by file counts alone is probably 
not ideal. File sizes can vary wildly.
 # More threads for localization is not necessarily a good thing. We currently 
have a configurable number of threads for public localizers (defaults to 4), 
plus 1 per container for private localizers. Increasing the number of threads 
running at once increases pressure on the NameNode, and for rotational disks, 
it may actually slow things down locally as well by increasing IOPS. SSD/NVME 
disks could probably handle more simultaneous localizers.
 # I don't like that FSDownload is just firing up some number of threads for 
Directories. I would prefer that the threading be done at a higher level 
(callers of FSDownload).
 # I think a better approach for allowing more threads for localization would 
be to to support parallel downloads in the private localizers, as suggested in 
YARN-574. Any solution needs to be configurable.

> Add parallel copying of directories into FSDownload
> ---
>
> Key: YARN-7713
> URL: https://issues.apache.org/jira/browse/YARN-7713
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Miklos Szegedi
>Assignee: Christos Karampeazis-Papadakis
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> YARN currently copies directories sequentially when localizing. This could be 
> improved to do in parallel, since the source blocks are normally on different 
> nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.

2021-04-28 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335017#comment-17335017
 ] 

Jim Brennan commented on YARN-10738:


[~zhuqi], I am not very familiar with the multi-threaded scheduling code - we 
have not started using it yet.  So it would be very helpful if you could 
provide more details about what you are observing in your cluster, and how you 
think this will fix it.  Is your cluster made up of many nodes that are the 
same size, or do you have a mix of different sizes?  If you have any data that 
shows some nodes being more heavily utilized than others, that would be helpful.

Looking at  {{ResourceUsageMultiNodeLookupPolicy}}, it seems to sort by 
allocated resources to a node, so this seems to be trying to ensure we allocate 
more evenly across nodes.  It doesn't consider the relative sizes of the nodes 
though, so in a heterogenous cluster, I could see it leading to smaller nodes 
being busier than larger nodes.   I wonder if a reverse sort by unallocated 
resources might be more fair, because it would favor nodes that have more room 
for new resource requests, rather than those that currently have fewer 
resources allocated.

Another option to consider would be to have a policy that uses node 
utilization, which should more accurately reflect how busy the node is.

With respect to the policy proposed in this ticket, I am not convinced it will 
help very much?  It's doing the same sort by allocated resources, but just 
adding a shuffle of every 10 nodes.  I'm not sure how much that will help in 
practice on a large cluster.  A rack is usually more than 10 nodes, so it's 
possible the same set of racks will be over-utilized.   Again, it would be 
helpful if you had some before/after data to show how it helps in a real 
cluster.


> When multi thread scheduling with multi node, we should shuffle with a gap to 
> prevent hot accessing nodes.
> --
>
> Key: YARN-10738
> URL: https://issues.apache.org/jira/browse/YARN-10738
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now the multi threading scheduling with multi node is not reasonable.
> In large clusters, it will cause the hot accessing nodes, which will lead the 
> abnormal boom node.
> Solution:
> I think we should shuffle the sorted node (such the available resource sort 
> policy) with an interval. 
> I will solve the above problem, and avoid the hot accessing node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330998#comment-17330998
 ] 

Jim Brennan commented on YARN-10743:


[~zhuqi] thanks for updating the patch.  I agree it would be good to file a 
jira on the lack of documentation for these policies.

I am +1 on patch 003.  I will commit later today.


> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch, YARN-10743.002.patch, 
> YARN-10743.003.patch, image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9594) Fix missing break statement in ContainerScheduler#handle

2021-04-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330918#comment-17330918
 ] 

Jim Brennan commented on YARN-9594:
---

[~xiaoheipangzi] thanks for fixing this.  I took the liberty of pulling it back 
to branch-2.10.

> Fix missing break statement in ContainerScheduler#handle
> 
>
> Key: YARN-9594
> URL: https://issues.apache.org/jira/browse/YARN-9594
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9594_1.patch
>
>
> It seems that we miss a break in switch-case
> {code:java}
> case RECOVERY_COMPLETED:
>   startPendingContainers(maxOppQueueLength <= 0);
>   metrics.setQueuedContainers(queuedOpportunisticContainers.size(),
>  queuedGuaranteedContainers.size());
> //break;missed
> default:
>   LOG.error("Unknown event arrived at ContainerScheduler: "
> + event.toString());
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9594) Fix missing break statement in ContainerScheduler#handle

2021-04-23 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9594:
--
Fix Version/s: 2.10.2

> Fix missing break statement in ContainerScheduler#handle
> 
>
> Key: YARN-9594
> URL: https://issues.apache.org/jira/browse/YARN-9594
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3, 2.10.2
>
> Attachments: YARN-9594_1.patch
>
>
> It seems that we miss a break in switch-case
> {code:java}
> case RECOVERY_COMPLETED:
>   startPendingContainers(maxOppQueueLength <= 0);
>   metrics.setQueuedContainers(queuedOpportunisticContainers.size(),
>  queuedGuaranteedContainers.size());
> //break;missed
> default:
>   LOG.error("Unknown event arrived at ContainerScheduler: "
> + event.toString());
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330834#comment-17330834
 ] 

Jim Brennan commented on YARN-10743:


Thanks for the patch [~zhuqi]!  The code looks good to me.  Can you please fix 
the checkstyle issues, and also document this policy in the comment in 
LogAggregationContext.  It doesn't look like these policies are documented 
anywhere else?


> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch, YARN-10743.002.patch, 
> image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-20 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326052#comment-17326052
 ] 

Jim Brennan commented on YARN-10743:


[~ebadger] I was going to say the same thing.  I'm ok with this as an option, 
since it appears there are cases where it could be helpful. 



> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch, image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325332#comment-17325332
 ] 

Jim Brennan commented on YARN-10460:


+1 on the branch-2.10 patch.


> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, 
> YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325278#comment-17325278
 ] 

Jim Brennan commented on YARN-10460:


+1 on the branch-3.2 patch.  Looks good to me.


> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in

[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325102#comment-17325102
 ] 

Jim Brennan commented on YARN-10743:


I don't think this is necessary.  The logs may actually be useful in debugging 
why the job is logging so much.

> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10

2021-04-14 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10733.

Fix Version/s: 2.10.2
   Resolution: Fixed

Thanks [~ahussein], I have committed this to branch-2.10.



> TimelineService Hbase tests are failing with timeout error on branch-2.10
> -
>
> Key: YARN-10733
> URL: https://issues.apache.org/jira/browse/YARN-10733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver, yarn
>Affects Versions: 2.10.0
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.2
>
> Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, 
> 2021-04-12T12-40-58_857.dumpstream, 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:bash}
> 03:54:41 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
> project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout 
> or other error in the fork -> [Help 1]
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with 
> the -e switch.
> 03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 03:54:41 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] After correcting the problems, you can resume the build with 
> the command
> 03:54:41 [ERROR]   mvn  -rf 
> :hadoop-yarn-server-timelineservice-hbase-tests
> {code}
> Failure of the tests is due to test unit 
> {{TestHBaseStorageFlowRunCompaction}} getting stuck.
> Upon checking the surefire reports, I found several Class no Found Exceptions.
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 33 more
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 51 more
> {code}
> and 
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 10 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Commented] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10

2021-04-14 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321237#comment-17321237
 ] 

Jim Brennan commented on YARN-10733:


[~ahussein] reached out to our Hbase team, and they did not have any concerns 
about this.


> TimelineService Hbase tests are failing with timeout error on branch-2.10
> -
>
> Key: YARN-10733
> URL: https://issues.apache.org/jira/browse/YARN-10733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver, yarn
>Affects Versions: 2.10.0
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, 
> 2021-04-12T12-40-58_857.dumpstream, 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:bash}
> 03:54:41 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
> project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout 
> or other error in the fork -> [Help 1]
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with 
> the -e switch.
> 03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 03:54:41 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] After correcting the problems, you can resume the build with 
> the command
> 03:54:41 [ERROR]   mvn  -rf 
> :hadoop-yarn-server-timelineservice-hbase-tests
> {code}
> Failure of the tests is due to test unit 
> {{TestHBaseStorageFlowRunCompaction}} getting stuck.
> Upon checking the surefire reports, I found several Class no Found Exceptions.
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 33 more
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 51 more
> {code}
> and 
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 10 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Commented] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10

2021-04-13 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320278#comment-17320278
 ] 

Jim Brennan commented on YARN-10733:


It looks like the fix is to change hbase-compatible-hadoop.version to 2.7.0 
(from 2.5.1).
Seems ok, but I am not sure if it will break anything?


> TimelineService Hbase tests are failing with timeout error on branch-2.10
> -
>
> Key: YARN-10733
> URL: https://issues.apache.org/jira/browse/YARN-10733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver, yarn
>Affects Versions: 2.10.0
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, 
> 2021-04-12T12-40-58_857.dumpstream, 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:bash}
> 03:54:41 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
> project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout 
> or other error in the fork -> [Help 1]
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with 
> the -e switch.
> 03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 03:54:41 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] After correcting the problems, you can resume the build with 
> the command
> 03:54:41 [ERROR]   mvn  -rf 
> :hadoop-yarn-server-timelineservice-hbase-tests
> {code}
> Failure of the tests is due to test unit 
> {{TestHBaseStorageFlowRunCompaction}} getting stuck.
> Upon checking the surefire reports, I found several Class no Found Exceptions.
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 33 more
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 51 more
> {code}
> and 
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 10 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To

[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2021-04-07 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316489#comment-17316489
 ] 

Jim Brennan commented on YARN-10475:


[~chaosju] thanks for your comment.  The implementation we provided here is 
using overall cluster utilization vs node utilization to adjust the heartbeat 
so that under-utilized nodes get more scheduling opportunities.  Note that this 
feature was developed internally on branch-2 before the global scheduler was 
added.   It has worked well to help keep our nodes more evenly utilized. 

I think that other metrics for scaling the heartbeat are definitely worth 
exploring, which is why we filed [YARN-10478] to make it pluggable.  That would 
be a good place to make suggestions for alternate approaches.


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316478#comment-17316478
 ] 

Jim Brennan commented on YARN-10702:


Thanks again [~ebadger]!  I put up additional patches for branch-3.2 and 
branch-3.1. 


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.2.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-07 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.1.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-06 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315597#comment-17315597
 ] 

Jim Brennan commented on YARN-10702:


The failed unit test for branch-3.3 is unrelated.  Looks like it was fixed in 
[YARN-10337], which was only committed to trunk.


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315196#comment-17315196
 ] 

Jim Brennan commented on YARN-10702:


Thanks [~ebadger]!  I have put up a patch for branch-3.3.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702-branch-3.3.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314912#comment-17314912
 ] 

Jim Brennan commented on YARN-10702:


Not going to fix these checkstyle issues, because the new variables match the 
pattern for the others.
{noformat}

./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java:66:
rmEventProcCPUAvg;:5: Variable 'rmEventProcCPUAvg' must be private and have 
accessor methods. [VisibilityModifier]
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java:68:
rmEventProcCPUMax;:5: Variable 'rmEventProcCPUMax' must be private and have 
accessor methods. [VisibilityModifier]
 {noformat}

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314651#comment-17314651
 ] 

Jim Brennan commented on YARN-10702:


patch 006 addresses the checkstyle/spotbug issues.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-04 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.006.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-02 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314086#comment-17314086
 ] 

Jim Brennan commented on YARN-10702:


Patch 005 adds a configuration property for this:
{noformat}

  
Resource manager dispatcher thread cpu monitor sampling rate.
Units are samples per minute.  This controls how often to sample
the cpu utilization of the resource manager dispatcher thread.
The cpu utilization is displayed on the RM UI as scheduler busy %.
Set this to zero to disable the dispatcher thread monitor.  Defaults
to 60 samples per minute.
  
  yarn.dispatcher.cpu-monitor.samples-per-min
  60

 {noformat}
If it is disabled by setting this property to zero, the UI shows "N/A" for the 
Scheduler Busy value, to distinguish it from 0, which is a valid avg cpu usage 
for the thread on a lightly loaded cluster.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-02 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.005.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307114#comment-17307114
 ] 

Jim Brennan commented on YARN-10697:


Thanks for the update [~BilwaST]!  I am +1 on patch 003.  [~epayne]. [~jhung], 
if there are no objections I will commit this later today.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> YARN-10697.003.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-22 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306489#comment-17306489
 ] 

Jim Brennan edited comment on YARN-10697 at 3/22/21, 7:02 PM:
--

Thanks for the update [~BilwaST]!
(edited) patch 002 looks mostly good, but can you please rename getResources()? 
 There is already a public Resource.getResources(), and the two functions are 
completely different.
Maybe the private one should be called getFormattedString()?   The new public 
one could also be getFormattedString().




was (Author: jim_brennan):
Thanks for the update [~BilwaST]!  +1 patch 002 looks good to me.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-22 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306489#comment-17306489
 ] 

Jim Brennan commented on YARN-10697:


Thanks for the update [~BilwaST]!  +1 patch 002 looks good to me.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, YARN-10697.002.patch, 
> image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305110#comment-17305110
 ] 

Jim Brennan commented on YARN-10702:


Thanks for the suggestions [~gandras]!  I agree this should be configurable.  I 
will put up a new patch with those changes.

I don't think the new thread has a significant impact.  I wasn't trying to 
measure that, but when I was looking at an RM recently where the dispatcher 
thread was very busy, the monitoring thread did not appear to be a significant 
factor, it was popping up as using less than 10% of a single CPU for brief 
periods of time IIRC.  I'll have to take a closer look.  But I think making the 
sampling rate configurable is a good idea.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304941#comment-17304941
 ] 

Jim Brennan commented on YARN-10697:


{quote}
So can we introduce a new method in Resource.java which can print it in 
MB|GB|TB?
{quote}
[~BilwaST] I think that is a good suggestion.  There are places where this 
format would be nice.


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.004.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304250#comment-17304250
 ] 

Jim Brennan commented on YARN-10702:


Jumped the gun.  Patch 004 has fixes for the other checkstyle issues.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304240#comment-17304240
 ] 

Jim Brennan commented on YARN-10702:


Thanks for the review [~zhuqi]!  patch 003 fixes the method names as suggested.


> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-18 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.003.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-17 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303747#comment-17303747
 ] 

Jim Brennan commented on YARN-10697:


[~BilwaST] I agree about the bug in MetricsOverviewTable.render().  Unless I am 
misunderstanding, the else case is improperly using bytes where it should be 
using MB.  I am not sure about the change to Resource.toString() though.  That 
is used in a lot of places and I am not sure if all of those places would 
prefer the terser MB|GB|TB format.  [~epayne], [~jhung] what do you think?


> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303724#comment-17303724
 ] 

Jim Brennan commented on YARN-10702:


patch 002 is rebased to current trunk.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.002.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: Scheduler-Busy.png

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: simon-scheduler-busy.png

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303697#comment-17303697
 ] 

Jim Brennan commented on YARN-10702:


Attaching some images of how this looks on the RM legacy UI and also the new 
metrics in simon.
 !Scheduler-Busy.png! 



> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10702:
---
Attachment: YARN-10702.001.patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10702.001.patch
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)

Jim Brennan created YARN-10702:
--

 Summary: Add cluster metric for amount of CPU used by RM Event 
Processor
 Key: YARN-10702
 URL: https://issues.apache.org/jira/browse/YARN-10702
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Add a cluster metric to track the cpu usage of the ResourceManager Event 
Processing thread.   This lets us know when the critical path of the RM is 
running out of headroom.
This feature was originally added for us internally by [~nroberts] and we've 
been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-13 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300968#comment-17300968
 ] 

Jim Brennan commented on YARN-10588:


Yes.  I think it is ok to do it as a separate Jira.  We may actually want to 
keep {{isAllInvalidDivisor}} as is, and change {{isInvalidDivisor}} to only 
consider the countable resources.

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-12 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300369#comment-17300369
 ] 

Jim Brennan commented on YARN-10687:


Thanks for the updates [~zhuqi]!  I am +1 on patch 004.   I will commit this 
later today.

> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch, YARN-10687.002.patch, 
> YARN-10687.003.patch, YARN-10687.004.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-11 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299827#comment-17299827
 ] 

Jim Brennan commented on YARN-10687:


Thanks for updating [~zhuqi]!  Can you please add a unit test to 
TestDirectoryCollection?  Also, did you see [~gandras]'s suggestion about 
changing the property names?  e.g., {{disk-utilization-percentage.enabled}}?

> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch, YARN-10687.002.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.

2021-03-11 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299638#comment-17299638
 ] 

Jim Brennan commented on YARN-10688:


[~zhuqi] we are very interested in this feature.  [~ebadger] can you take a 
look?


> ClusterMetrics should support GPU related metrics.
> --
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10687) Add option to disable/enable free disk space checking and percentage checking for full and not-full disks

2021-03-11 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299613#comment-17299613
 ] 

Jim Brennan commented on YARN-10687:


Thanks [~zhuqi]!  These are not strictly needed, because the values can be set 
to effectively enable/disable each threshold.  But I agree that having these 
makes the configuration options easier to understand.  In addition the the name 
changes recommended by [~gandras], we should also update the descriptions for 
the threshold properties to indicate that they only apply when the 
corresponding {{enabled}} property is true.


> Add option to disable/enable free disk space checking and percentage checking 
> for full and not-full disks
> -
>
> Key: YARN-10687
> URL: https://issues.apache.org/jira/browse/YARN-10687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10687.001.patch
>
>
> Now the two option:
> max-disk-utilization-per-disk-percentage
>  min-free-space-per-disk-mb
> for full/not full disk check are all enabled always, i think it's more 
> reasonable to enable or disable this, and default will be all enabled.
>  
> In our clusters, when the disk is so huge we want to use  
> min-free-space-per-disk-mb.
> In our clusters,  when the disk is so small we want to use 
> max-disk-utilization-per-disk-percentage.
>  
> We should make this more reasonable and not confused.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-10 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298991#comment-17298991
 ] 

Jim Brennan commented on YARN-10588:


This looks good, but I have one question about 
{{DominantResourceCalculator.isAllInvalidDivisor()}}.
Looking at the {{divide()}} and {{ratio()}} methods for that class, I wonder if 
we should be looping over the first 
{{ResourceUtils.getNumberOfCountableResourceTypes()}} resource types instead of 
all of them.  It may be moot at this point, if all resource types are 
countable.  But if there are non-countable resource types, I think it would be 
incorrect as-is.


> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297701#comment-17297701
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the reviews and the commits [~ebadger]!

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297522#comment-17297522
 ] 

Jim Brennan commented on YARN-10664:


Thanks [~ebadger]!  I have put up a patch for branch-3.2.I'm not sure it is 
worth trying to pull back further than that, because we start running into more 
conflicts with other changes that have not been pulled back.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664-branch-3.2.004.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296093#comment-17296093
 ] 

Jim Brennan commented on YARN-8786:
---

I am ok with closing it. 

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295623#comment-17295623
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the review [~ebadger]!  I have put up patch 004 to address the 
concern about expanding the keystore/truststore keys.

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.004.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295589#comment-17295589
 ] 

Jim Brennan commented on YARN-10664:


It would have to be an environment variable that contains a variable denoted 
with the expansion characters, double squigglies.  Looking through 
{{santizeEnv()}}, I don't see anything that would potentially include those, 
other than {{NM_ADMIN_USER_ENV}} itself.   If someone was using those in an 
{{NM_ADMIN_USER_ENV}}-defined variable, and depending on that not being 
expanded, that would be a problem.  But I don't think that is likely.

Looking back at the {{call()}} method, there might be a problem there.  There 
is code that adds {{KEYSTORE_PASSWORD_ENV_NAME}} to the environment.  If it is 
possible for that value to have those characters, that might cause an unwanted 
expansion.  Might need to move the setting of those truststore variables after 
the call to {{expandEnvironment}}.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10665) TestContainerManagerRecovery sometimes fails

2021-03-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295541#comment-17295541
 ] 

Jim Brennan commented on YARN-10665:


I am going to close this as invalid.  I have not been able to reproduce since 
rebooting my mac, and on further analysis, I believe the hard-coded value of 
49160 was explicitly chosen to be beyond the start of the ephemeral range for 
all platforms.


> TestContainerManagerRecovery sometimes fails
> 
>
> Key: YARN-10665
> URL: https://issues.apache.org/jira/browse/YARN-10665
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10665.001.patch
>
>
> TestContainerManagerRecovery sometimes fails when I run it on the mac because 
> it cannot bind to a port.   I believe this is because it calls getPort with a 
> hard-coded port number (49160) instead of just passing zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.003.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295518#comment-17295518
 ] 

Jim Brennan commented on YARN-10664:


Trying to put up patch 003, which is identical to patch 002, to see if it 
triggers precommit builds.


> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch, 
> YARN-10664.003.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10665) TestContainerManagerRecovery sometimes fails

2021-03-03 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10665:
---
Summary: TestContainerManagerRecovery sometimes fails  (was: 
TestContainerManagerRecover sometimes fails)

> TestContainerManagerRecovery sometimes fails
> 
>
> Key: YARN-10665
> URL: https://issues.apache.org/jira/browse/YARN-10665
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
>
> TestContainerManagerRecovery sometimes fails when I run it on the mac because 
> it cannot bind to a port.   I believe this is because it calls getPort with a 
> hard-coded port number (49160) instead of just passing zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294032#comment-17294032
 ] 

Jim Brennan commented on YARN-10664:


Thanks for the review [~ebadger]!  I put up patch 002 to remove the 
TestContainerManagerRecovery change.   I have filed [YARN-10665] to address 
that issue.

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10665) TestContainerManagerRecover sometimes fails

2021-03-02 Thread Jim Brennan (Jira)

Jim Brennan created YARN-10665:
--

 Summary: TestContainerManagerRecover sometimes fails
 Key: YARN-10665
 URL: https://issues.apache.org/jira/browse/YARN-10665
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


TestContainerManagerRecovery sometimes fails when I run it on the mac because 
it cannot bind to a port.   I believe this is because it calls getPort with a 
hard-coded port number (49160) instead of just passing zero.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.002.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch, YARN-10664.002.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10664:
---
Attachment: YARN-10664.001.patch

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10664.001.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)

Jim Brennan created YARN-10664:
--

 Summary: Allow parameter expansion in NM_ADMIN_USER_ENV
 Key: YARN-10664
 URL: https://issues.apache.org/jira/browse/YARN-10664
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
expansion.  That is, you cannot specify an environment variable such as 
{code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside the 
container.

We have a need for this in specifying different java gc options for java 
processing running inside yarn containers based on which version of java is 
being used.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291064#comment-17291064
 ] 

Jim Brennan commented on YARN-10613:


[~epayne], I have committed to trunk and branch-3.3, but the patch does not 
work for branch-3.2.  I can get it to apply, but then compilation fails.  Can 
you put up a patch for branch-3.2, and branch-3.1 if needed?

 

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290520#comment-17290520
 ] 

Jim Brennan commented on YARN-10613:


Thanks for the update [~epayne].  This looks good to me.  +1 on patch 002.

I will wait for the pre-commit build to finish before committing this.

 

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290079#comment-17290079
 ] 

Jim Brennan commented on YARN-10613:


Thanks [~epayne]! The patch looks good to me but for two minor issues:
 # In ProportionalCapacityPreemptionPolicy, I think you need to add a {{"\n"}} 
between the new lines.
 # In the new inter-queue test, I think you should explicitly set the property 
to true instead of relying on 
{{DEFAULT_CROSS_QUEUE_PREEMPTION_CONSERVATIVE_DRF}} to be true (line 237).  
Same applies for the first part of the Intra-Queue test, you should explicitly 
set the property to true - it's currently not set at all.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289400#comment-17289400
 ] 

Jim Brennan commented on YARN-10613:


I think you should stick with INTRA_QUEUE_PREEMPTION_CONFIG_PREFIX instead of 
adding the new one.
We already have that redundancy for other configs.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289325#comment-17289325
 ] 

Jim Brennan commented on YARN-10613:


I like your proposal for the more easily distinguished property names.


> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10626) Log resource allocation in NM log at container start time

2021-02-16 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285299#comment-17285299
 ] 

Jim Brennan commented on YARN-10626:


+1. This looks good to me [~ebadger]!  I agree we don't need a unit test.  I 
will commit today.


> Log resource allocation in NM log at container start time
> -
>
> Key: YARN-10626
> URL: https://issues.apache.org/jira/browse/YARN-10626
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10626.001.patch, YARN-10626.002.patch
>
>
> As far as I can tell, there are no resource allocation logs in the NM log for 
> the various containers that are scheduled. These can be useful when trying to 
> debug what resources were requested vs what resources were actually 
> allocated. This is especially useful when debugging upstream technology 
> changes to make sure that they are correctly interpreting and passing down 
> resource parameters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-5853) TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently on Power

2021-02-11 Thread Jim Brennan (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-5853.
---
Resolution: Duplicate

This is fixed by YARN-10500

> TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently 
> on Power
> --
>
> Key: YARN-5853
> URL: https://issues.apache.org/jira/browse/YARN-5853
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
> Environment: # uname -a
> Linux pts00452-vm10 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 
> 2015 ppc64le ppc64le ppc64le GNU/Linux
> # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: Yussuf Shaikh
>Priority: Major
>
> The test testRMRestartWithExpiredToken fails intermittently with the 
> following error:
> Stacktrace:
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken(TestDelegationTokenRenewer.java:1060)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-11 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283357#comment-17283357
 ] 

Jim Brennan commented on YARN-10500:


Thanks for the update [~iwasakims]!  I have committed to trunk and will 
cherry-pick to other branches.


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-10 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282670#comment-17282670
 ] 

Jim Brennan commented on YARN-10500:


Actually I missed noticing there were some check-style issues.   [~iwasakims]. 
can you please fix those?  And while you are at it, there is an unneeded 
{{throws Exception}} on {{testShutdown()}}.  Can you remove that as well?


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2021-02-10 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282664#comment-17282664
 ] 

Jim Brennan commented on YARN-10500:


+1 Thanks for fixing this [~iwasakims]!   I will commit shortly.


> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Masatake Iwasaki
>Priority: Major
>  Labels: flaky-test, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-02-05 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279997#comment-17279997
 ] 

Jim Brennan commented on YARN-10588:


Thanks for reporting this and putting up the patch [~BilwaST]!  I wonder if a 
better fix would be to change {{DominantResourceCalculator.isInvalidDivisor()}} 
to be consistent with  {{DominantResourceCalculator.divide()}}?   Currently it 
returns true if any resource is zero, while {{divide}} is only going to return 
zero if all of the countable ones are zero.

[~epayne] what do you think?


> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-05 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279810#comment-17279810
 ] 

Jim Brennan commented on YARN-10607:


Thanks for the updates [~ebadger].  +1 This looks good to me.
I will commit today.

> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch, 
> YARN-10607.003.patch, YARN-10607.004.patch, YARN-10607.004.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279200#comment-17279200
 ] 

Jim Brennan commented on YARN-10607:


Thanks [~ebadger].  This looks good overall, but I have one comment.  The unit 
test you added only runs on Windows, which I think is correct, but the code in 
ContainerLaunch.sanitizeEnv() runs on
 windows as well.  I'm not sure this feature really makes sense on windows, and 
in particular this line is certainly not correct for windows:
{noformat}
Apps.addToEnvironment(environment, Environment.PATH.name(),
"$PATH", File.pathSeparator);
{noformat}
My suggestion is to put the force-path code inside a {{!Shell.WINDOWS}} check.
We might want to update the documentation as well to note it is ignored on 
windows.

> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch, 
> YARN-10607.003.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10607) User environment is unable to prepend PATH when mapreduce.admin.user.env also sets PATH

2021-02-04 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279016#comment-17279016
 ] 

Jim Brennan commented on YARN-10607:


Thanks [~ebadger]!  Patch 002 seems to have an extraneous change to 
TestCapacitySchedulerMultiNodes.java.  Any idea where that came from?


> User environment is unable to prepend PATH when mapreduce.admin.user.env also 
> sets PATH
> ---
>
> Key: YARN-10607
> URL: https://issues.apache.org/jira/browse/YARN-10607
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10607.001.patch, YARN-10607.002.patch
>
>
> When using the tarball approach to ship relevant Hadoop jars to containers, 
> it is helpful to set {{mapreduce.admin.user.env}} to something like 
> {{PATH=./hadoop-tarball:\{\{PATH\}\}}} to make sure that all of the Hadoop 
> binaries are on the PATH. This way you can call {{hadoop}} instead of 
> {{./hadoop-tarball/hadoop}}. The intention here is to force prepend 
> {{./hadoop-tarball}} and then append the set {{PATH}} afterwards. But if a 
> user would like to override the appended portion of {{PATH}} in their 
> environment, they are unable to do so. This is because {{PATH}} ends up 
> getting parsed twice. Initially it is set via {{mapreduce.admin.user.env}} to 
> {{PATH=./hadoop-tarball:$SYS_PATH}}}. In this case {{SYS_PATH}} is what I'll 
> refer to as the normal system path. E.g. {{/usr/local/bin:/usr/bin}}, etc.
> After this, the user env parsing happens. For example, let's say the user 
> sets their {{PATH}} to {{PATH=.:$PATH}}. We have already parsed {{PATH}} from 
> the admin.user.env. Then we go to parse the user environment and find the 
> user also specified {{PATH}}. So {{$PATH}} ends up getting getting expanded 
> to {{./hadoop-tarball:$SYS_PATH}}, which leads to the user's {{PATH}} being 
> {{PATH=.:./hadoop-tarball:$SYS_PATH}}. We then append this to {{PATH}}, which 
> has already been set in the environment map via the admin.user.env. So we 
> finally end up with 
> {{PATH=./hadoop-tarball:$SYS_PATH:.:./hadoop-tarball:$SYS_PATH}}. 
> This normally isn't a huge deal, but if you want to ship a version of 
> python/perl/etc. that clashes with the one that is already there in 
> {{SYS_PATH}}, you will need to refer to it by its full path. Since in the 
> above example, {{.}} doesn't appear until after {{$SYS_PATH}}. This is a pain 
> and it should be possible to prepend its {{PATH}} to override the 
> system/container {{SYS_PATH}}, even when also forcefully prepending to 
> {{PATH}} with you hadoop tarball.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10613) Config to allow Intra-queue preemption to enable/disable conservativeDRF

2021-02-03 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278349#comment-17278349
 ] 

Jim Brennan commented on YARN-10613:


[~epayne] any reason we shouldn't add a property for inter-queue-preemption as 
well, so that both are configurable?

> Config to allow Intra-queue preemption to  enable/disable conservativeDRF
> -
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of the following config property:
> {code:xml}
>   
> 
> yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
> true
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

2021-01-19 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267945#comment-17267945
 ] 

Jim Brennan commented on YARN-10562:


Thanks [~ebadger]!

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

2021-01-15 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266098#comment-17266098
 ] 

Jim Brennan commented on YARN-10562:


Given that the original bug exists in branch-2 as well, I think back-porting to 
branch-2 is a good idea in this case.

 

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264327#comment-17264327
 ] 

Jim Brennan commented on YARN-4589:
---

[~epayne], I have attached a patch for branch-3.2.  I have also verified that 
it applies cleanly to branch-3.1.

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589-branch-3.2.001.patch, YARN-4589.004.patch, 
> YARN-4589.005.patch, YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 3 4 5 6 7 8 9 >

1 - 100 of 816 matches

Mail list logo