from:"Siddharth Ahuja \(Jira\)"

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Attachment: (was: YARN-10063.004.patch)

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10207.001.patch, YARN-10207.002.patch, 
> YARN-10207.003.patch, YARN-10207.004.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at

[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957
 ] 

Siddharth Ahuja commented on YARN-10207:


Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings (see the attached screenshot). I updated the code 
slightly so that I fixed up the indentation as per the guidelines and also in 
some cases prevented the issue altogether. Let me know if this resolves your 
comment. Thanks again!

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Indentation settings.png, YARN-10063.004.patch, 
> YARN-10207.001.patch, YARN-10207.002.patch, YARN-10207.003.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
>

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Attachment: (was: Indentation settings.png)

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10063.004.patch, YARN-10207.001.patch, 
> YARN-10207.002.patch, YARN-10207.003.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:02 PM:
--

Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings. I updated the code slightly so that I fixed up the 
indentation as per the guidelines and also in some cases prevented the issue 
altogether. Let me know if this resolves your comment. Thanks again!


was (Author: sahuja):
Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings (see the attached screenshot). I updated the code 
slightly so that I fixed up the indentation as per the guidelines and also in 
some cases prevented the issue altogether. Let me know if this resolves your 
comment. Thanks again!

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10063.004.patch, YARN-10207.001.patch, 
> YARN-10207.002.patch, YARN-10207.003.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
>

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Attachment: YARN-10207.001.patch

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10207.001.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
>

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:50 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071621#comment-17071621
 ] 

Siddharth Ahuja commented on YARN-10207:


Fixing up checkstyle warnings as per 
https://builds.apache.org/job/PreCommit-YARN-Build/25787/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt.

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10207.001.patch, YARN-10207.002.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Attachment: YARN-10207.002.patch

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10207.001.patch, YARN-10207.002.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
>

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:36 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:35 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream if we do indeed hit any 
and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so 
that it is able to catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja commented on YARN-10207:


Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader fails during instantiation 
inside AggregatedLogFormat.java, specifically, when TFile.Reader creation 
within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted 
file passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream if we do indeed hit any 
and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so 
that it is able to catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here:

[jira] [Assigned] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message

2020-04-01 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-9355:
-

Assignee: Umesh  (was: Siddharth Ahuja)

> RMContainerRequestor#makeRemoteRequest has confusing log message
> 
>
> Key: YARN-9355
> URL: https://issues.apache.org/jira/browse/YARN-9355
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Umesh
>Priority: Trivial
>  Labels: newbie, newbie++
>
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest 
> has this log: 
> {code:java}
> if (ask.size() > 0 || release.size() > 0) {
>   LOG.info("getResources() for " + applicationId + ":" + " ask="
>   + ask.size() + " release= " + release.size() + " newContainers="
>   + allocateResponse.getAllocatedContainers().size()
>   + " finishedContainers=" + numCompletedContainers
>   + " resourcelimit=" + availableResources + " knownNMs="
>   + clusterNmCount);
> }
> {code}
> The reason why "getResources()" is printed because 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources 
> invokes makeRemoteRequest. This is not too informative and error-prone as 
> name of getResources could change over time and the log will be outdated. 
> Moreover, it's not a good idea to print a method name from a method below the 
> current one in the stack.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics

2020-03-29 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070557#comment-17070557
 ] 

Siddharth Ahuja commented on YARN-5277:
---

Thank you for the tool suggestion [~brahmareddy]! Kindly allow me some time to 
set this up internally and put out a formal patch and I will update the JIRA. 
Thanks again for your kind help.

> when localizers fail due to resource timestamps being out, provide more 
> diagnostics
> ---
>
> Key: YARN-5277
> URL: https://issues.apache.org/jira/browse/YARN-5277
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Siddharth Ahuja
>Priority: Major
>
> When an NM fails a resource D/L as the timestamps are wrong, there's not much 
> info, just two long values. 
> It would be good to also include the local time values, *and the current wall 
> time*. These are the things people need to know when trying to work out what 
> went wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message

2020-03-29 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070651#comment-17070651
 ] 

Siddharth Ahuja edited comment on YARN-9355 at 3/30/20, 2:50 AM:
-

Hi [~ykabusalah], not sure how this got pulled from under my name. I would have 
expected a check with me before to do that. I have recently started to get 
going on the JIRAs that were created by [~snemeth] at my work, as such, was 
gonna work this one in the near future. If you do still want to carry on with 
this one, please go ahead but I would appreciate checking it with the assignee 
before to just take it in future.


was (Author: sahuja):
Hi [~ykabusalah], not sure how this got pulled from under my name. I would have 
expected a check with me before to do that. I have recently started to get 
going on the JIRAs that were created by [~snemeth] at my work, as such, was 
gonna work this one in the near future. If you do still want to carry on with 
this one, please go ahead but I would appreciate checking it with the assignee 
before to just nick it in future.

> RMContainerRequestor#makeRemoteRequest has confusing log message
> 
>
> Key: YARN-9355
> URL: https://issues.apache.org/jira/browse/YARN-9355
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Yousef Abu-Salah
>Priority: Trivial
>  Labels: newbie, newbie++
>
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest 
> has this log: 
> {code:java}
> if (ask.size() > 0 || release.size() > 0) {
>   LOG.info("getResources() for " + applicationId + ":" + " ask="
>   + ask.size() + " release= " + release.size() + " newContainers="
>   + allocateResponse.getAllocatedContainers().size()
>   + " finishedContainers=" + numCompletedContainers
>   + " resourcelimit=" + availableResources + " knownNMs="
>   + clusterNmCount);
> }
> {code}
> The reason why "getResources()" is printed because 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources 
> invokes makeRemoteRequest. This is not too informative and error-prone as 
> name of getResources could change over time and the log will be outdated. 
> Moreover, it's not a good idea to print a method name from a method below the 
> current one in the stack.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message

2020-03-29 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070651#comment-17070651
 ] 

Siddharth Ahuja commented on YARN-9355:
---

Hi [~ykabusalah], not sure how this got pulled from under my name. I would have 
expected a check with me before to do that. I have recently started to get 
going on the JIRAs that were created by [~snemeth] at my work, as such, was 
gonna work this one in the near future. If you do still want to carry on with 
this one, please go ahead but I would appreciate checking it with the assignee 
before to just nick it in future.

> RMContainerRequestor#makeRemoteRequest has confusing log message
> 
>
> Key: YARN-9355
> URL: https://issues.apache.org/jira/browse/YARN-9355
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Yousef Abu-Salah
>Priority: Trivial
>  Labels: newbie, newbie++
>
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest 
> has this log: 
> {code:java}
> if (ask.size() > 0 || release.size() > 0) {
>   LOG.info("getResources() for " + applicationId + ":" + " ask="
>   + ask.size() + " release= " + release.size() + " newContainers="
>   + allocateResponse.getAllocatedContainers().size()
>   + " finishedContainers=" + numCompletedContainers
>   + " resourcelimit=" + availableResources + " knownNMs="
>   + clusterNmCount);
> }
> {code}
> The reason why "getResources()" is printed because 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources 
> invokes makeRemoteRequest. This is not too informative and error-prone as 
> name of getResources could change over time and the log will be outdated. 
> Moreover, it's not a good idea to print a method name from a method below the 
> current one in the stack.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message

2020-03-29 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-9355:
-

Assignee: Siddharth Ahuja  (was: Yousef Abu-Salah)

> RMContainerRequestor#makeRemoteRequest has confusing log message
> 
>
> Key: YARN-9355
> URL: https://issues.apache.org/jira/browse/YARN-9355
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Trivial
>  Labels: newbie, newbie++
>
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest 
> has this log: 
> {code:java}
> if (ask.size() > 0 || release.size() > 0) {
>   LOG.info("getResources() for " + applicationId + ":" + " ask="
>   + ask.size() + " release= " + release.size() + " newContainers="
>   + allocateResponse.getAllocatedContainers().size()
>   + " finishedContainers=" + numCompletedContainers
>   + " resourcelimit=" + availableResources + " knownNMs="
>   + clusterNmCount);
> }
> {code}
> The reason why "getResources()" is printed because 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources 
> invokes makeRemoteRequest. This is not too informative and error-prone as 
> name of getResources could change over time and the log will be outdated. 
> Moreover, it's not a good idea to print a method name from a method below the 
> current one in the stack.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message

2020-03-29 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070679#comment-17070679
 ] 

Siddharth Ahuja commented on YARN-9355:
---

Thanks [~ykabusalah], no worries!

> RMContainerRequestor#makeRemoteRequest has confusing log message
> 
>
> Key: YARN-9355
> URL: https://issues.apache.org/jira/browse/YARN-9355
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Yousef Abu-Salah
>Priority: Trivial
>  Labels: newbie, newbie++
>
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest 
> has this log: 
> {code:java}
> if (ask.size() > 0 || release.size() > 0) {
>   LOG.info("getResources() for " + applicationId + ":" + " ask="
>   + ask.size() + " release= " + release.size() + " newContainers="
>   + allocateResponse.getAllocatedContainers().size()
>   + " finishedContainers=" + numCompletedContainers
>   + " resourcelimit=" + availableResources + " knownNMs="
>   + clusterNmCount);
> }
> {code}
> The reason why "getResources()" is printed because 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources 
> invokes makeRemoteRequest. This is not too informative and error-prone as 
> name of getResources could change over time and the log will be outdated. 
> Moreover, it's not a good idea to print a method name from a method below the 
> current one in the stack.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Description: 
Issue reproduced using the following steps:

# Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026.
# Copied an aggregated log file from HDFS to local FS:
{code}
hdfs dfs -get 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Updated the TFile metadata at the bottom of this file with some junk to 
corrupt the file :
*Before:*
{code}

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
{code}
*After:*
{code}  

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
{code}
Notice "blah" (junk) added at the very end.
# Remove the existing aggregated log file that will need to be replaced by our 
modified copy from step 3 (as otherwise HDFS will prevent it from placing the 
file with the same name as it already exists):
{code}  
hdfs dfs -rm -r -f 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Upload the corrupted aggregated file back to HDFS:
{code}
hdfs dfs -put _8041 
/tmp/logs/systest/logs/application_1582676649923_0026
{code}
# Visit HistoryServer Web UI
# Click on job_1582676649923_0026
# Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
# Review the JHS logs, following exception will be seen:
{code}
2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
getting logs for job_1582676649923_0026
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at 
org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at 
org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at 
org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at 
org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Description: 
Issue reproduced using the following steps:

# Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026.
# Copied an aggregated log file from HDFS to local FS:
{code}
hdfs dfs -get 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Updated the TFile metadata at the bottom of this file with some junk to 
corrupt the file :
*Before:*
{code}

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
{code}
*After:*
{code}  

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
{code}
Notice "blah" (junk) added at the very end.
# Remove the existing aggregated log file that will need to be replaced by our 
modified copy from step 3 (as otherwise HDFS will prevent it from placing the 
file with the same name as it already exists):
{code}  
hdfs dfs -rm -r -f 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Upload the corrupted aggregated file back to HDFS:
{code}
hdfs dfs -put _8041 
/tmp/logs/systest/logs/application_1582676649923_0026
{code}
# Visit HistoryServer Web UI
# Click on job_1582676649923_0026
# Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
# Review the JHS logs, following exception will be seen:
{code}
2020-03-24 20:03:48,484 ERROR 
org.apache.hadoop.yarn.webapp.View: Error getting logs for 
job_1582676649923_0026
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at 
org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at 
org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at 
org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at 
org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at 
org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Description: 
Issue reproduced using the following steps:

# Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026.
# Copied an aggregated log file from HDFS to local FS:
{code}
hdfs dfs -get 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Updated the TFile metadata at the bottom of this file with some junk to 
corrupt the file :
*Before:*
{code}

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
{code}
*After:*
{code}  

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
{code}
Notice "blah" added at the very end.
# Remove the existing aggregated log file that will need to be replaced by our 
modified copy from step 3 (as otherwise HDFS will prevent it from placing the 
file with the same name as it already exists):
{code}  
hdfs dfs -rm -r -f 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Upload the corrupted aggregated file back to HDFS:
{code}
hdfs dfs -put _8041 
/tmp/logs/systest/logs/application_1582676649923_0026
{code}
# Visit HistoryServer Web UI
# Click on job_1582676649923_0026
# Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
# Review the JHS logs, following exception will be seen:
{code}
2020-03-24 20:03:48,484 ERROR 
org.apache.hadoop.yarn.webapp.View: Error getting logs for 
job_1582676649923_0026
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at 
org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at 
org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at 
org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at 
org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at 
org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)

[jira] [Assigned] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10207:
--

Assignee: Siddharth Ahuja

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
>
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR 
> org.apache.hadoop.yarn.webapp.View: Error getting logs for 
> job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at 
> org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
>

[jira] [Created] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)

Siddharth Ahuja created YARN-10207:
--

 Summary: CLOSE_WAIT socket connection leaks during rendering of 
(corrupted) aggregated logs on the JobHistoryServer Web UI
 Key: YARN-10207
 URL: https://issues.apache.org/jira/browse/YARN-10207
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Siddharth Ahuja


Issue reproduced using the following steps:

# Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026.
# Copied an aggregated log file from HDFS to local FS:
{code}
hdfs dfs -get 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Updated the TFile metadata at the bottom of this file with some junk to 
corrupt the file :
*Before:*
{code}

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
{code}

*After:*

{code}  

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
{code}  
Notice "blah" added at the very end.
# Remove the existing aggregated log file that will need to be replaced by our 
modified copy from step 3 (as otherwise HDFS will prevent it from placing the 
file with the same name as it already exists):
{code}  
hdfs dfs -rm -r -f 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Upload the corrupted aggregated file back to HDFS:
{code}
hdfs dfs -put _8041 
/tmp/logs/systest/logs/application_1582676649923_0026
{code}
# Visit HistoryServer Web UI
# Click on job_1582676649923_0026
# Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
# Review the JHS logs, following exception will be seen:
{code}
2020-03-24 20:03:48,484 ERROR 
org.apache.hadoop.yarn.webapp.View: Error getting logs for 
job_1582676649923_0026
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at 
org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at 
org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at 
org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at 
org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at 
org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)

[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-25 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10207:
---
Description: 
File descriptor leaks are observed coming from the JobHistoryServer process 
while it tries to render a "corrupted" aggregated log on the JHS Web UI.

Issue reproduced using the following steps:

# Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026.
# Copied an aggregated log file from HDFS to local FS:
{code}
hdfs dfs -get 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Updated the TFile metadata at the bottom of this file with some junk to 
corrupt the file :
*Before:*
{code}

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
{code}
*After:*
{code}  

^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
{code}
Notice "blah" (junk) added at the very end.
# Remove the existing aggregated log file that will need to be replaced by our 
modified copy from step 3 (as otherwise HDFS will prevent it from placing the 
file with the same name as it already exists):
{code}
hdfs dfs -rm -r -f 
/tmp/logs/systest/logs/application_1582676649923_0026/_8041
{code}
# Upload the corrupted aggregated file back to HDFS:
{code}
hdfs dfs -put _8041 
/tmp/logs/systest/logs/application_1582676649923_0026
{code}
# Visit HistoryServer Web UI
# Click on job_1582676649923_0026
# Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
# Review the JHS logs, following exception will be seen:
{code}
2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
getting logs for job_1582676649923_0026
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at 
org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at 
org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at 
org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at 
org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at

[jira] [Updated] (YARN-9996) Code cleanup in QueueAdminConfigurationMutationACLPolicy

2020-04-21 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-9996:
--
Attachment: YARN-9996-branch-3.2.001.patch

> Code cleanup in QueueAdminConfigurationMutationACLPolicy
> 
>
> Key: YARN-9996
> URL: https://issues.apache.org/jira/browse/YARN-9996
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9996-branch-3.2.001.patch, 
> YARN-9996-branch-3.2.001.patch, YARN-9996-branch-3.2.001.patch, 
> YARN-9996-branch-3.3.001.patch, YARN-9996.001.patch
>
>
> Method 'isMutationAllowed' contains many uses of substring and lastIndexOf.
> These could be extracted and simplified. 
> Also, some logging could be added as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9996) Code cleanup in QueueAdminConfigurationMutationACLPolicy

2020-04-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091002#comment-17091002
 ] 

Siddharth Ahuja commented on YARN-9996:
---

Thank you [~snemeth]!

> Code cleanup in QueueAdminConfigurationMutationACLPolicy
> 
>
> Key: YARN-9996
> URL: https://issues.apache.org/jira/browse/YARN-9996
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.4.0
>
> Attachments: YARN-9996-branch-3.2.001.patch, 
> YARN-9996-branch-3.2.001.patch, YARN-9996-branch-3.2.001.patch, 
> YARN-9996-branch-3.3.001.patch, YARN-9996.001.patch
>
>
> Method 'isMutationAllowed' contains many uses of substring and lastIndexOf.
> These could be extracted and simplified. 
> Also, some logging could be added as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer

2020-03-22 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176
 ] 

Siddharth Ahuja edited comment on YARN-10075 at 3/22/20, 9:23 AM:
--

Just uploaded a patch that does the following:
# Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only 
usage of historyContext in the class was to be passed in as an argument during 
the instantiation of the HistoryClientService and nothing else. Therefore, it 
is now cleaned up and the HistoryClientService is now instantiated by casting 
the jobHistoryService with HistoryContext.
# One test class - _TestJHSSecurity_ was found to be abusing this protected 
attribute during the creation of a jobHistoryServer inside this test class. The 
historyContext attribute was being referenced directly (bad) inside 
createHistoryClientService method during creation of the mock job history 
server. In fact, the only use of implementing this helper method seems to be 
passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) 
during the creation of the history client service. However, this is not 
required because jobHistoryServer.init(conf) will result in the same due to the 
serviceInit() call within JobHistoryServer that will call 
createHistoryClientService() which will end up using the custom 
jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) 
happens before createHistoryClientService()).
# Removed a commented out line -  _final JobHistoryServer jobHistoryServer = 
jhServer;_ from the test class.


was (Author: sahuja):
Just uploaded a patch that does the following:
# Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only 
usage of historyContext in the class was to be passed in as an argument during 
the instantiation of the HistoryClientService and nothing else. Therefore, it 
is now cleaned up and the HistoryClientService is now instantiated by casting 
the jobHistoryService with HistoryContext.
# One test class - _TestJHSSecurity_ was found to be abusing this protected 
attribute during the creation of a jobHistoryServer inside this test class. The 
historyContext attribute was being referenced directly (bad) inside 
createHistoryClientService method during creation of the mock job history 
server. In fact, the only use of implementing this helper method seems to be 
passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) 
during the creation of the history client service. However, this is not 
required because jobHistoryServer.init(conf) will result in the same due to the 
serviceInit() call within JobHistoryServer that will call 
createHistoryClientService() which will end up using the custom 
jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) 
happens before createHistoryClientService()).
# Cleaned up an unused commented line -  _final JobHistoryServer 
jobHistoryServer = jhServer;_ from the test class.

> historyContext doesn't need to be a class attribute inside JobHistoryServer
> ---
>
> Key: YARN-10075
> URL: https://issues.apache.org/jira/browse/YARN-10075
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10075.001.patch
>
>
> "historyContext" class attribute at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  is assigned a cast of another class attribute - "jobHistoryService" - 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131,
>  however it does not need to be stored separately because it is only ever 
> used once in the clas, and that too as an argument while instantiating the 
> HistoryClientService class at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155.
> Therefore, we could just delete the lines at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131
>  completely and instantiate the HistoryClientService as follows:

[jira] [Comment Edited] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer

2020-03-22 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176
 ] 

Siddharth Ahuja edited comment on YARN-10075 at 3/22/20, 9:24 AM:
--

Just uploaded a patch that does the following:
# Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only 
usage of historyContext in the class was to be passed in as an argument during 
the instantiation of the HistoryClientService and nothing else. Therefore, it 
is now cleaned up and the HistoryClientService is now instantiated by casting 
the jobHistoryService with HistoryContext.
# One test class - _TestJHSSecurity_ was found to be abusing this protected 
attribute during the creation of a jobHistoryServer inside this test class. The 
historyContext attribute was being referenced directly (bad) inside 
createHistoryClientService method during creation of the mock job history 
server. In fact, the only use of implementing this helper method seems to be 
passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) 
during the creation of the history client service. However, this is not 
required because jobHistoryServer.init(conf) will result in the same due to the 
serviceInit() call within JobHistoryServer that will call 
createHistoryClientService() which will end up using the custom 
jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) 
happens before createHistoryClientService()).
# Removed a commented out line -  _final JobHistoryServer jobHistoryServer = 
jhServer;_ from the test class as it was near the code that was being cleaned 
up in 2.


was (Author: sahuja):
Just uploaded a patch that does the following:
# Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only 
usage of historyContext in the class was to be passed in as an argument during 
the instantiation of the HistoryClientService and nothing else. Therefore, it 
is now cleaned up and the HistoryClientService is now instantiated by casting 
the jobHistoryService with HistoryContext.
# One test class - _TestJHSSecurity_ was found to be abusing this protected 
attribute during the creation of a jobHistoryServer inside this test class. The 
historyContext attribute was being referenced directly (bad) inside 
createHistoryClientService method during creation of the mock job history 
server. In fact, the only use of implementing this helper method seems to be 
passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) 
during the creation of the history client service. However, this is not 
required because jobHistoryServer.init(conf) will result in the same due to the 
serviceInit() call within JobHistoryServer that will call 
createHistoryClientService() which will end up using the custom 
jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) 
happens before createHistoryClientService()).
# Removed a commented out line -  _final JobHistoryServer jobHistoryServer = 
jhServer;_ from the test class.

> historyContext doesn't need to be a class attribute inside JobHistoryServer
> ---
>
> Key: YARN-10075
> URL: https://issues.apache.org/jira/browse/YARN-10075
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10075.001.patch
>
>
> "historyContext" class attribute at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  is assigned a cast of another class attribute - "jobHistoryService" - 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131,
>  however it does not need to be stored separately because it is only ever 
> used once in the clas, and that too as an argument while instantiating the 
> HistoryClientService class at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155.
> Therefore, we could just delete the lines at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131
>  completely and

[jira] [Updated] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer

2020-03-21 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10075:
---
Attachment: YARN-10075.001.patch

> historyContext doesn't need to be a class attribute inside JobHistoryServer
> ---
>
> Key: YARN-10075
> URL: https://issues.apache.org/jira/browse/YARN-10075
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10075.001.patch
>
>
> "historyContext" class attribute at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  is assigned a cast of another class attribute - "jobHistoryService" - 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131,
>  however it does not need to be stored separately because it is only ever 
> used once in the clas, and that too as an argument while instantiating the 
> HistoryClientService class at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155.
> Therefore, we could just delete the lines at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131
>  completely and instantiate the HistoryClientService as follows:
> {code}
>   @VisibleForTesting
>   protected HistoryClientService createHistoryClientService() {
> return new HistoryClientService((HistoryContext)jobHistoryService, 
> this.jhsDTSecretManager);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer

2020-03-22 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176
 ] 

Siddharth Ahuja commented on YARN-10075:


Just uploaded a patch that does the following:
# Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only 
usage of historyContext in the class was to be passed in as an argument during 
the instantiation of the HistoryClientService and nothing else. Therefore, it 
is now cleaned up and the HistoryClientService is now instantiated by casting 
the jobHistoryService with HistoryContext.
# One test class - _TestJHSSecurity_ was found to be abusing this protected 
attribute during the creation of a jobHistoryServer inside this test class. The 
historyContext attribute was being referenced directly (bad) inside 
createHistoryClientService method during creation of the mock job history 
server. In fact, the only use of implementing this helper method seems to be 
passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) 
during the creation of the history client service. However, this is not 
required because jobHistoryServer.init(conf) will result in the same due to the 
serviceInit() call within JobHistoryServer that will call 
createHistoryClientService() which will end up using the custom 
jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) 
happens before createHistoryClientService()).
# Cleaned up an unused commented line -  _final JobHistoryServer 
jobHistoryServer = jhServer;_ from the test class.

> historyContext doesn't need to be a class attribute inside JobHistoryServer
> ---
>
> Key: YARN-10075
> URL: https://issues.apache.org/jira/browse/YARN-10075
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10075.001.patch
>
>
> "historyContext" class attribute at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  is assigned a cast of another class attribute - "jobHistoryService" - 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131,
>  however it does not need to be stored separately because it is only ever 
> used once in the clas, and that too as an argument while instantiating the 
> HistoryClientService class at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155.
> Therefore, we could just delete the lines at 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131
>  completely and instantiate the HistoryClientService as follows:
> {code}
>   @VisibleForTesting
>   protected HistoryClientService createHistoryClientService() {
> return new HistoryClientService((HistoryContext)jobHistoryService, 
> this.jhsDTSecretManager);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore

2020-03-22 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064473#comment-17064473
 ] 

Siddharth Ahuja commented on YARN-10001:


Hi [~snemeth], I have added explanations for methods that have no 
implementation - _checkVersion, storeVersion_ and that return a null (i.e. 
methods that do nothing) - _getCurrentVersion, getConfStoreVersion, getLogs, 
getConfirmedConfHistory._ Kindly let me know if you are ok with the 
descriptions (+cc [~wilfreds]).

> Add explanation of unimplemented methods in InMemoryConfigurationStore
> --
>
> Key: YARN-10001
> URL: https://issues.apache.org/jira/browse/YARN-10001
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10001.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064845#comment-17064845
 ] 

Siddharth Ahuja edited comment on YARN-10001 at 3/23/20, 2:40 PM:
--

This was the output from the earlier build:

{code}
-1 overall

| Vote |Subsystem |  Runtime   | Comment

|   0  |  reexec  |   0m 46s   | Docker mode activated. 
|  |  || Prechecks 
|  +1  | @author  |   0m  0s   | The patch does not contain any @author 
|  |  || tags.
|  -1  |  test4tests  |   0m  0s   | The patch doesn't appear to include 
|  |  || any new or modified tests. Please
|  |  || justify why no new tests are needed for
|  |  || this patch. Also please list what
|  |  || manual steps were performed to verify
|  |  || this patch.
|  |  || trunk Compile Tests 
|  +1  |  mvninstall  |  21m 48s   | trunk passed 
|  +1  | compile  |   0m 45s   | trunk passed 
|  +1  |  checkstyle  |   0m 35s   | trunk passed 
|  +1  | mvnsite  |   0m 47s   | trunk passed 
|  +1  |shadedclient  |  15m 31s   | branch has no errors when building and 
|  |  || testing our client artifacts.
|  +1  |findbugs  |   1m 35s   | trunk passed 
|  +1  | javadoc  |   0m 30s   | trunk passed 
|  |  || Patch Compile Tests 
|  +1  |  mvninstall  |   0m 43s   | the patch passed 
|  +1  | compile  |   0m 38s   | the patch passed 
|  +1  |   javac  |   0m 38s   | the patch passed 
|  -0  |  checkstyle  |   0m 27s   | 
|  |  || 
hadoop-yarn-project/hadoop-yarn/hadoop-y
|  |  || 
arn-server/hadoop-yarn-server-resourcema
|  |  || nager: The patch generated 7 new + 1
|  |  || unchanged - 0 fixed = 8 total (was 1)
|  +1  | mvnsite  |   0m 41s   | the patch passed 
|  +1  |  whitespace  |   0m  0s   | The patch has no whitespace issues. 
|  +1  |shadedclient  |  14m 22s   | patch has no errors when building and 
|  |  || testing our client artifacts.
|  +1  |findbugs  |   1m 40s   | the patch passed 
|  +1  | javadoc  |   0m 26s   | the patch passed 
|  |  || Other Tests 
|  +1  |unit  | 103m 21s   | hadoop-yarn-server-resourcemanager in 
|  |  || the patch passed.
|  +1  |  asflicense  |   0m 25s   | The patch does not generate ASF 
|  |  || License warnings.
|  |  | 164m 49s   | 
{code}

Note that the changes for this JIRA are only related to comments for methods, 
therefore, no new tests were added or modified (they don't need to).


was (Author: sahuja):
This was the output from the earlier build:

{code}
-1 overall

| Vote |Subsystem |  Runtime   | Comment

|   0  |  reexec  |   0m 46s   | Docker mode activated. 
|  |  || Prechecks 
|  +1  | @author  |   0m  0s   | The patch does not contain any @author 
|  |  || tags.
|  -1  |  test4tests  |   0m  0s   | The patch doesn't appear to include 
|  |  || any new or modified tests. Please
|  |  || justify why no new tests are needed for
|  |  || this patch. Also please list what
|  |  || manual steps were performed to verify
|  |  || this patch.
|  |  || trunk Compile Tests 
|  +1  |  mvninstall  |  21m 48s   | trunk passed 
|  +1  | compile  |   0m 45s   | trunk passed 
|  +1  |  checkstyle  |   0m 35s   | trunk passed 
|  +1  | mvnsite  |   0m 47s   | trunk passed 
|  +1  |shadedclient  |  15m 31s   | branch has no errors when building and 
|  |  || testing our client artifacts.
|  +1  |findbugs  |   1m 35s   | trunk passed 
|  +1  | javadoc  |   0m 30s   | trunk passed 
|  |  || Patch Compile Tests 
|  +1  |  mvninstall  |   0m 43s   | the patch passed 
|  +1  | compile  |   0m 38s   | the patch passed 
|  +1  |   javac  |   0m 38s   | the patch passed 
|  -0  |  checkstyle  |   0m 27s   | 
|  |

[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064845#comment-17064845
 ] 

Siddharth Ahuja commented on YARN-10001:


This was the output from the earlier build:

{code}
-1 overall

| Vote |Subsystem |  Runtime   | Comment

|   0  |  reexec  |   0m 46s   | Docker mode activated. 
|  |  || Prechecks 
|  +1  | @author  |   0m  0s   | The patch does not contain any @author 
|  |  || tags.
|  -1  |  test4tests  |   0m  0s   | The patch doesn't appear to include 
|  |  || any new or modified tests. Please
|  |  || justify why no new tests are needed for
|  |  || this patch. Also please list what
|  |  || manual steps were performed to verify
|  |  || this patch.
|  |  || trunk Compile Tests 
|  +1  |  mvninstall  |  21m 48s   | trunk passed 
|  +1  | compile  |   0m 45s   | trunk passed 
|  +1  |  checkstyle  |   0m 35s   | trunk passed 
|  +1  | mvnsite  |   0m 47s   | trunk passed 
|  +1  |shadedclient  |  15m 31s   | branch has no errors when building and 
|  |  || testing our client artifacts.
|  +1  |findbugs  |   1m 35s   | trunk passed 
|  +1  | javadoc  |   0m 30s   | trunk passed 
|  |  || Patch Compile Tests 
|  +1  |  mvninstall  |   0m 43s   | the patch passed 
|  +1  | compile  |   0m 38s   | the patch passed 
|  +1  |   javac  |   0m 38s   | the patch passed 
|  -0  |  checkstyle  |   0m 27s   | 
|  |  || 
hadoop-yarn-project/hadoop-yarn/hadoop-y
|  |  || 
arn-server/hadoop-yarn-server-resourcema
|  |  || nager: The patch generated 7 new + 1
|  |  || unchanged - 0 fixed = 8 total (was 1)
|  +1  | mvnsite  |   0m 41s   | the patch passed 
|  +1  |  whitespace  |   0m  0s   | The patch has no whitespace issues. 
|  +1  |shadedclient  |  14m 22s   | patch has no errors when building and 
|  |  || testing our client artifacts.
|  +1  |findbugs  |   1m 40s   | the patch passed 
|  +1  | javadoc  |   0m 26s   | the patch passed 
|  |  || Other Tests 
|  +1  |unit  | 103m 21s   | hadoop-yarn-server-resourcemanager in 
|  |  || the patch passed.
|  +1  |  asflicense  |   0m 25s   | The patch does not generate ASF 
|  |  || License warnings.
|  |  | 164m 49s   | 
{code}

Note that the changes for this JIRA are only comments, therefore, no new tests 
were added or modified (they don't need to).

> Add explanation of unimplemented methods in InMemoryConfigurationStore
> --
>
> Key: YARN-10001
> URL: https://issues.apache.org/jira/browse/YARN-10001
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10001.001.patch, YARN-10001.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064836#comment-17064836
 ] 

Siddharth Ahuja commented on YARN-10001:


Found checkstyle warnings coming from 
https://builds.apache.org/job/PreCommit-YARN-Build/25734/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt.

Got checkstyle checks imported as per 
https://github.com/apache/hadoop/tree/trunk/hadoop-build-tools/src/main/resources/checkstyle/
 in IntelliJ and managed to receive the same warnings there so I should be good 
for future patches.

Fixed them all up and delivering the new patch now.

> Add explanation of unimplemented methods in InMemoryConfigurationStore
> --
>
> Key: YARN-10001
> URL: https://issues.apache.org/jira/browse/YARN-10001
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10001.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269
 ] 

Siddharth Ahuja commented on YARN-5277:
---

Hi [~aajisaka], I am working on this JIRA and have a potential 
fix/implementation in terms of non-test source code. However, I did have a 
question regarding the Junit code coverage tool -> _Clover_ .

I tried to run the following command:

{code}
mvn test -Pclover
{code}

but it resulted in the following error:

{code}
Failed to execute goal 
com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on 
project hadoop-main: Failed to load resource as file 
[/Users//.clover.license]: Could not find resource 
'/Users/sidtheadmin/.clover.license'. -> [Help 1]
that I tried to run to see if we are already covering the impacted code through 
Junit testing or not. I used the following command to run it:
{code}

I could try and supply a clover license through :

{code}
mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license]
{code}

as per 
https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, 
however, I need the clover.license.

I somehow found a link where I could get that potentially - 
https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license
 but as I am not a committer, I don't have the credentials (I get asked for 
username/password).

As such, can you kindly help me with a clover license? I am really interesting 
in getting this so that I know if we already have an existing test method in 
the test class that already covers what I am trying to modify and hence, I can 
just update that method. If it is not covered yet, then, I will have to write 
up a new junit test for that. 

Thanks in advance for your kind assistance!


> when localizers fail due to resource timestamps being out, provide more 
> diagnostics
> ---
>
> Key: YARN-5277
> URL: https://issues.apache.org/jira/browse/YARN-5277
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Siddharth Ahuja
>Priority: Major
>
> When an NM fails a resource D/L as the timestamps are wrong, there's not much 
> info, just two long values. 
> It would be good to also include the local time values, *and the current wall 
> time*. These are the things people need to know when trying to work out what 
> went wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269
 ] 

Siddharth Ahuja edited comment on YARN-5277 at 3/24/20, 2:54 AM:
-

Hi [~aajisaka], I am working on this JIRA and have a potential 
fix/implementation in terms of non-test source code. However, I did have a 
question regarding the Junit code coverage tool -> _Clover_ .

I tried to run the following command:

{code}
mvn test -Pclover
{code}

but it resulted in the following error:

{code}
Failed to execute goal 
com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on 
project hadoop-main: Failed to load resource as file 
[/Users//.clover.license]: Could not find resource 
'/Users//.clover.license'. -> [Help 1]
that I tried to run to see if we are already covering the impacted code through 
Junit testing or not. I used the following command to run it:
{code}

I could try and supply a clover license through :

{code}
mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license]
{code}

as per 
https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, 
however, I need the clover.license.

I somehow found a link where I could get that potentially - 
https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license
 but as I am not a committer, I don't have the credentials (I get asked for 
username/password).

As such, can you kindly help me with a clover license? I am really interesting 
in getting this so that I know if we already have an existing test method in 
the test class that already covers what I am trying to modify and hence, I can 
just update that method. If it is not covered yet, then, I will have to write 
up a new junit test for that. 

Thanks in advance for your kind assistance!



was (Author: sahuja):
Hi [~aajisaka], I am working on this JIRA and have a potential 
fix/implementation in terms of non-test source code. However, I did have a 
question regarding the Junit code coverage tool -> _Clover_ .

I tried to run the following command:

{code}
mvn test -Pclover
{code}

but it resulted in the following error:

{code}
Failed to execute goal 
com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on 
project hadoop-main: Failed to load resource as file 
[/Users//.clover.license]: Could not find resource 
'/Users/sidtheadmin/.clover.license'. -> [Help 1]
that I tried to run to see if we are already covering the impacted code through 
Junit testing or not. I used the following command to run it:
{code}

I could try and supply a clover license through :

{code}
mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license]
{code}

as per 
https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, 
however, I need the clover.license.

I somehow found a link where I could get that potentially - 
https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license
 but as I am not a committer, I don't have the credentials (I get asked for 
username/password).

As such, can you kindly help me with a clover license? I am really interesting 
in getting this so that I know if we already have an existing test method in 
the test class that already covers what I am trying to modify and hence, I can 
just update that method. If it is not covered yet, then, I will have to write 
up a new junit test for that. 

Thanks in advance for your kind assistance!


> when localizers fail due to resource timestamps being out, provide more 
> diagnostics
> ---
>
> Key: YARN-5277
> URL: https://issues.apache.org/jira/browse/YARN-5277
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Siddharth Ahuja
>Priority: Major
>
> When an NM fails a resource D/L as the timestamps are wrong, there's not much 
> info, just two long values. 
> It would be good to also include the local time values, *and the current wall 
> time*. These are the things people need to know when trying to work out what 
> went wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics

2020-03-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269
 ] 

Siddharth Ahuja edited comment on YARN-5277 at 3/24/20, 2:56 AM:
-

Hi [~aajisaka], I am working on this JIRA and have a potential 
fix/implementation in terms of non-test source code. However, I did have a 
question regarding the Junit code coverage tool -> _Clover_ .

I tried to run the following command:

{code}
mvn test -Pclover
{code}

but it resulted in the following error:

{code}
Failed to execute goal 
com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on 
project hadoop-main: Failed to load resource as file 
[/Users//.clover.license]: Could not find resource 
'/Users//.clover.license'. -> [Help 1]
that I tried to run to see if we are already covering the impacted code through 
Junit testing or not. I used the following command to run it:
{code}

I could try and supply a clover license through :

{code}
mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license]
{code}

as per 
https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, 
however, I need the clover.license.

I somehow found a link where I could get that potentially - 
https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license
 but as I am not a committer, I don't have the credentials (I get asked for 
username/password).

As such, can you kindly help me with a clover license? I am really interesting 
in getting this so that I know if we already have an existing test method in 
the test class that already covers what I am trying to modify and hence, I can 
just update that method. If it is not covered yet, then, I will have to write 
up a new junit test for that. I don't want to be reviewing multiple existing 
test methods to understand if something is covered or not as this approach is 
not robust.

Thanks in advance for your kind assistance!



was (Author: sahuja):
Hi [~aajisaka], I am working on this JIRA and have a potential 
fix/implementation in terms of non-test source code. However, I did have a 
question regarding the Junit code coverage tool -> _Clover_ .

I tried to run the following command:

{code}
mvn test -Pclover
{code}

but it resulted in the following error:

{code}
Failed to execute goal 
com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on 
project hadoop-main: Failed to load resource as file 
[/Users//.clover.license]: Could not find resource 
'/Users//.clover.license'. -> [Help 1]
that I tried to run to see if we are already covering the impacted code through 
Junit testing or not. I used the following command to run it:
{code}

I could try and supply a clover license through :

{code}
mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license]
{code}

as per 
https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, 
however, I need the clover.license.

I somehow found a link where I could get that potentially - 
https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license
 but as I am not a committer, I don't have the credentials (I get asked for 
username/password).

As such, can you kindly help me with a clover license? I am really interesting 
in getting this so that I know if we already have an existing test method in 
the test class that already covers what I am trying to modify and hence, I can 
just update that method. If it is not covered yet, then, I will have to write 
up a new junit test for that. 

Thanks in advance for your kind assistance!


> when localizers fail due to resource timestamps being out, provide more 
> diagnostics
> ---
>
> Key: YARN-5277
> URL: https://issues.apache.org/jira/browse/YARN-5277
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Siddharth Ahuja
>Priority: Major
>
> When an NM fails a resource D/L as the timestamps are wrong, there's not much 
> info, just two long values. 
> It would be good to also include the local time values, *and the current wall 
> time*. These are the things people need to know when trying to work out what 
> went wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10000) Code cleanup in FSSchedulerConfigurationStore

2020-05-23 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114592#comment-17114592
 ] 

Siddharth Ahuja commented on YARN-1:


Hi [~BilwaST], thanks for checking, however, I intend to work on all of my 
JIRAs in the near future.

> Code cleanup in FSSchedulerConfigurationStore
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Minor
>
> Some things could be improved:
> * In initialize: PathFilter can be replaced with lambda
> * initialize is long, could be split into smaller methods
> * In method 'format': for-loop can be replaced with foreach
> * There's a variable with a typo: lastestConfigPath
> * Add explanation of unimplemented methods
> * Abstract Filesystem operations away more: 
> * Bad logging: Format string is combined with exception logging.
> {code:java}
> LOG.info("Failed to write config version at {}", configVersionFile, e);
> {code}
> * Interestingly phrased log messages like "write temp capacity configuration 
> fail" "write temp capacity configuration successfully, schedulerConfigFile="
> * Method "writeConfigurationToFileSystem" could be private
> * Any other code quality improvements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment

2020-09-01 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10416:
--

Assignee: Siddharth Ahuja

> Typos in YarnScheduler#allocate method's doc comment
> 
>
> Key: YARN-10416
> URL: https://issues.apache.org/jira/browse/YARN-10416
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: docs
>Reporter: Wanqiang Ji
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie
>
> {code:java}
> /**
>  * The main api between the ApplicationMaster and the Scheduler.
>  * The ApplicationMaster is updating his future resource requirements
>  * and may release containers he doens't need.
>  */
> {code}
>  
> doens't correct to doesn't



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment

2020-09-02 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189076#comment-17189076
 ] 

Siddharth Ahuja commented on YARN-10416:


No tests required as the updates are only javadoc-related.

> Typos in YarnScheduler#allocate method's doc comment
> 
>
> Key: YARN-10416
> URL: https://issues.apache.org/jira/browse/YARN-10416
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: docs
>Reporter: Wanqiang Ji
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-10416.001.patch
>
>
> {code:java}
> /**
>  * The main api between the ApplicationMaster and the Scheduler.
>  * The ApplicationMaster is updating his future resource requirements
>  * and may release containers he doens't need.
>  */
> {code}
>  
> doens't correct to doesn't



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment

2020-09-01 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10416:
---
Attachment: YARN-10416.001.patch

> Typos in YarnScheduler#allocate method's doc comment
> 
>
> Key: YARN-10416
> URL: https://issues.apache.org/jira/browse/YARN-10416
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: docs
>Reporter: Wanqiang Ji
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-10416.001.patch
>
>
> {code:java}
> /**
>  * The main api between the ApplicationMaster and the Scheduler.
>  * The ApplicationMaster is updating his future resource requirements
>  * and may release containers he doens't need.
>  */
> {code}
>  
> doens't correct to doesn't



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment

2020-09-01 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10416:
---
Attachment: (was: YARN-10416.001.patch)

> Typos in YarnScheduler#allocate method's doc comment
> 
>
> Key: YARN-10416
> URL: https://issues.apache.org/jira/browse/YARN-10416
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: docs
>Reporter: Wanqiang Ji
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-10416.001.patch
>
>
> {code:java}
> /**
>  * The main api between the ApplicationMaster and the Scheduler.
>  * The ApplicationMaster is updating his future resource requirements
>  * and may release containers he doens't need.
>  */
> {code}
>  
> doens't correct to doesn't



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment

2020-09-01 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188991#comment-17188991
 ] 

Siddharth Ahuja commented on YARN-10416:


* Fixed up the overall method description,
* Added explanation for individual params,
* The {{updateRequests}} param's explanation was incorrectly set to the return 
type:
{code}
updateRequests - @return the Allocation for the application
{code}
Fixed this so that updateRequests has its own explanation and the return type 
is moved on to its own line.

> Typos in YarnScheduler#allocate method's doc comment
> 
>
> Key: YARN-10416
> URL: https://issues.apache.org/jira/browse/YARN-10416
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: docs
>Reporter: Wanqiang Ji
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie
>
> {code:java}
> /**
>  * The main api between the ApplicationMaster and the Scheduler.
>  * The ApplicationMaster is updating his future resource requirements
>  * and may release containers he doens't need.
>  */
> {code}
>  
> doens't correct to doesn't



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823
]

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:45 AM:
-

Testing done on the platform:

+* 1. Test Jstack collection for non-RUNNING app:*+

a. Ensure there is a YARN application that is already present
from a previous run and is NOT currently RUNNING.
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the non-running app. Jstack button should be visible.
c. Click on Jstack button. Error message should be displayed
-> "Jstack cannot be collected for an application that is not running." because
it is not possible to collect Jstack for a non-running application as it has no
running containers.

+* 2. Test for Jstack collection for a RUNNING app:*+
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Repeat step e. from above for another container. A thread
dump should be captured and visible in the panel containing the stdout logs.
g. Go back and repeat step e. for the same container that was
first selected. Notice that 2 thread dumps are now present in the stdout logs
with the latest thread dump shown later in the stdout logs.

+* 3. Error checking - Jstack fetch attempt for a container that is not
running due to killed application:*+

a. Kill the currently RUNNING application using: yarn
application -kill ,
b. Now try selecting a container from the drop-down containing
containers listing. Jstack collection is not possible and hence the error is
displayed -> "Jstack fetch failed for container: due to:
“Trying to signal an absent container ”.

* 4. Error checking - Jstack fetch attempt for a container while RMs/NMs
not available:*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Stop the ResourceManager/s.
g. Select a different container from the drop-down list. An
error should be displayed -> "Jstack fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
h. Restart the ResourceManager/s.
i. Repeat steps a. until e.
j. Stop NodeManager/s.
k. Select a different container from the drop-down list. An
error should be displayed -> "Logs fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
l. Start back the NodeManager/s.

*+ 5. Check latest (and the ONLY) running app attempt id is displayed:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Now, run the following command to terminate the currently
running AM:

yarn container -signal
GRACEFUL_SHUTDOWN

e. Run the following command to check the currently running
app_attempt_id:

yarn applicationattempt -list

[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
 ] 

Siddharth Ahuja commented on YARN-1806:
---

This JIRA implements a feature for the addition of a "Jstack" button on the 
ResourceManager Web UI's individual application page accessible by visiting RM 
Web UI -> Applications -> Click on  (So, the breadcrumb would be "Home 
/ Applications / App [app_id] / Jstack") to trigger thread dumps for running 
YARN containers for a currently running application attempt. The thread dumps 
are captured as part of the stdout logs for the selected container and 
displayed as-is by querying the NodeManager node on which this container ran on.

As part of this feature, there are 2 panels implemented. The first panel 
displays two drop-downs, the first one displaying the currently running app 
attempt id and a "None" option (similar to "Logs" functionality). Once this is 
selected, it goes on to display another drop-down in the same panel that 
contains a listing of currently running containers for this application attempt 
id.

Once you select a container id from this second drop-down, another Panel is 
opened just below (again this is similar to the "Logs" functionality) that 
shows the selected attempt id and the container as the header with container's 
stdout logs also being displayed containing the thread dump that was triggered 
when the container was selected.

Following sets of API calls are made:

API calls made when the Jstack button is clicked:
1. http://:8088/ws/v1/cluster/apps/ -> Get application info 
e.g. app state from RM,
2. http://:8088/ws/v1/cluster/apps//appattempts -> Get 
application attempt info from RM, e.g. to get the app attempt state to see if 
it is RUNNING or not 
([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]).

If the application is not RUNNING, then, there will be an error displayed for 
that based on info from 1. above. 
If the application is RUNNING, then, by checking the application attempts info 
for this app (there can be more than one app attempt), we display the 
application attempt id for the RUNNING attempt only. This is based on the info 
from 2. above.

API calls made when the app attempt is selected from the drop-down:
3. 
http://:8088/ws/v1/cluster/apps//appattempts//containers
 -> This is to get the list of running containers for the currently running app 
attempt from the RM.

API calls made when the container is selected from the drop-down:
4. 
http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name=
 -> This is for RM (that eventually calls NM through NM heartbeat) to send a 
SIGQUIT signal to the container process for the selected container 
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is 
essentially a kill -3 and it generates a thread dump that are captured in the 
stdout logs of the container.
http://:8042/ws/v1/node/containerlogs//stdout -> This is for 
the NM that is running the selected container to acquire the stdout logs from 
this running container that contains the thread dump by the above call. 

> webUI update to allow end users to request thread dump
> --
>
> Key: YARN-1806
> URL: https://issues.apache.org/jira/browse/YARN-1806
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ming Ma
>Assignee: Siddharth Ahuja
>Priority: Major
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
]

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:48 AM:
-

This JIRA implements a feature for the addition of a "*Jstack*" button on the
ResourceManager Web UI's individual application page accessible by visiting RM
Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home
/ Applications / App [app_id] / Jstack}}) to trigger thread dumps for running
YARN containers for a currently running application attempt. The thread dumps
are captured as part of the stdout logs for the selected container and
displayed as-is by querying the NodeManager node on which this container ran on.

As part of this feature, there are 2 panels implemented. The first panel
displays two drop-downs, the first one displaying the currently running app
attempt id and a "None" option (similar to "Logs" functionality). Once this is
selected, it goes on to display another drop-down in the same panel that
contains a listing of currently running containers for this application attempt
id.

Once you select a container id from this second drop-down, another Panel is
opened just below (again this is similar to the "Logs" functionality) that
shows the selected attempt id and the container as the header with container's
stdout logs also being displayed containing the thread dump that was triggered
when the container was selected.

Following sets of API calls are made:

+API calls made when the Jstack button is clicked:+
1. http://:8088/ws/v1/cluster/apps/ -> Get application info
e.g. app state from RM,
2. http://:8088/ws/v1/cluster/apps//appattempts -> Get
application attempt info from RM, e.g. to get the app attempt state to see if
it is RUNNING or not
([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]).

If the application is not RUNNING, then, there will be an error displayed for
that based on info from 1. above.
If the application is RUNNING, then, by checking the application attempts info
for this app (there can be more than one app attempt), we display the
application attempt id for the RUNNING attempt only. This is based on the info
from 2. above.

+API calls made when the app attempt is selected from the drop-down:+
3.
http://:8088/ws/v1/cluster/apps//appattempts//containers
-> This is to get the list of running containers for the currently running app
attempt from the RM.

+API calls made when the container is selected from the drop-down:+
4.
http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name=
-> This is for RM (that eventually calls NM through NM heartbeat) to send a
SIGQUIT signal to the container process for the selected container
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is
essentially a kill -3 and it generates a thread dump that are captured in the
stdout logs of the container.
5. http://:8042/ws/v1/node/containerlogs//stdout -> This is
for the NM that is running the selected container to acquire the stdout logs
from this running container that contains the thread dump by the above call.

was (Author: sahuja):
This JIRA implements a feature for the addition of a "*Jstack*" button on the
ResourceManager Web UI's individual application page accessible by visiting RM
Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home
/ Applications / App [app_id] / Jstack}}) to trigger thread dumps for running
YARN containers for a currently running application attempt. The thread dumps
are captured as part of the stdout logs for the selected container and
displayed as-is by querying the NodeManager node on which this container ran on.

Following sets of API calls are made:

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823
]

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:47 AM:
-

Testing done on the platform:

*+1. Test Jstack collection for non-RUNNING app:+*

*+2. Test for Jstack collection for a RUNNING app:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Repeat step e. from above for another container. A thread
dump should be captured and visible in the panel containing the stdout logs.
g. Go back and repeat step e. for the same container that was
first selected. Notice that 2 thread dumps are now present in the stdout logs
with the latest thread dump shown later in the stdout logs.

*+3. Error checking - Jstack fetch attempt for a container that is not running
due to killed application:+*

*+4. Error checking - Jstack fetch attempt for a container while RMs/NMs not
available:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Stop the ResourceManager/s.
g. Select a different container from the drop-down list. An
error should be displayed -> "Jstack fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
h. Restart the ResourceManager/s.
i. Repeat steps a. until e.
j. Stop NodeManager/s.
k. Select a different container from the drop-down list. An
error should be displayed -> "Logs fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
l. Start back the NodeManager/s.

*+5. Check latest (and the ONLY) running app attempt id is displayed:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Now, run the following command to terminate the currently
running AM:

yarn container -signal
GRACEFUL_SHUTDOWN

e. Run the following command to check the currently running
app_attempt_id:

yarn applicationattempt -list application_1598288770104_0003

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
]

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:43 AM:
-

Following sets of API calls are made:

+API calls made when the container is selected from the drop-down:+
4.
http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name=
-> This is for RM (that eventually calls NM through NM heartbeat) to send a
SIGQUIT signal to the container process for the selected container
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is
essentially a kill -3 and it generates a thread dump that are captured in the
stdout logs of the container.
http://:8042/ws/v1/node/containerlogs//stdout -> This is for
the NM that is running the selected container to acquire the stdout logs from
this running container that contains the thread dump by the above call.

Following sets of API calls are made:

API calls made when the Jstack button is clicked:
1. http://:8088/ws/v1/cluster/apps/ -> Get application info
e.g. app state from RM,
2. http://:8088/ws/v1/cluster/apps//appattempts -> Get
application attempt info from RM, e.g. to get the app attempt state to see if
it is RUNNING or not

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
]

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:43 AM:
-

Following sets of API calls are made:

API calls made when the app attempt is selected from the drop-down:
3.
http://:8088/ws/v1/cluster/apps//appattempts//containers
-> This is to get the list of running containers for the currently running app
attempt from the RM.

API calls made when the container is selected from the drop-down:
4.
http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name=
-> This is for RM (that eventually calls NM through NM heartbeat) to send a
SIGQUIT signal to the container process for the selected container
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is
essentially a kill -3 and it generates a thread dump that are captured in the
stdout logs of the container.
http://:8042/ws/v1/node/containerlogs//stdout -> This is for
the NM that is running the selected container to acquire the stdout logs from
this running container that contains the thread dump by the above call.

was (Author: sahuja):
This JIRA implements a feature for the addition of a "Jstack" button on the
ResourceManager Web UI's individual application page accessible by visiting RM
Web UI -> Applications -> Click on (So, the breadcrumb would be "Home
/ Applications / App [app_id] / Jstack") to trigger thread dumps for running
YARN containers for a currently running application attempt. The thread dumps
are captured as part of the stdout logs for the selected container and
displayed as-is by querying the NodeManager node on which this container ran on.

Following sets of API calls are made:

[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823
]

Siddharth Ahuja commented on YARN-1806:
---

Testing done on the platform:

1. Test Jstack collection for non-RUNNING app:

2. Test for Jstack collection for a RUNNING app:
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Repeat step e. from above for another container. A thread
dump should be captured and visible in the panel containing the stdout logs.
g. Go back and repeat step e. for the same container that was
first selected. Notice that 2 thread dumps are now present in the stdout logs
with the latest thread dump shown later in the stdout logs.

3. Error checking - Jstack fetch attempt for a container that is not
running due to killed application:

4. Error checking - Jstack fetch attempt for a container while RMs/NMs
not available:
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Stop the ResourceManager/s.
g. Select a different container from the drop-down list. An
error should be displayed -> "Jstack fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
h. Restart the ResourceManager/s.
i. Repeat steps a. until e.
j. Stop NodeManager/s.
k. Select a different container from the drop-down list. An
error should be displayed -> "Logs fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
l. Start back the NodeManager/s.

5. Check latest (and the ONLY) running app attempt id is displayed:
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Now, run the following command to terminate the currently
running AM:

yarn container -signal
GRACEFUL_SHUTDOWN

e. Run the following command to check the currently running
app_attempt_id:

yarn applicationattempt -list application_1598288770104_0003

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183919#comment-17183919
 ] 

Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 10:36 AM:
--

Submitting the initial patch for your review [~akhilpb].


was (Author: sahuja):
Submitting the initial patch.

> webUI update to allow end users to request thread dump
> --
>
> Key: YARN-1806
> URL: https://issues.apache.org/jira/browse/YARN-1806
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ming Ma
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-1806.001.patch
>
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
]

Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:24 AM:
-

This JIRA implements a feature for the addition of a "*Threaddump*" button on
the ResourceManager Web UI's individual application page accessible by visiting
RM Web UI -> Applications -> Click on (So, the breadcrumb would be
{{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps
for running YARN containers for a currently running application attempt. The
thread dumps are captured as part of the stdout logs for the selected container
and displayed as-is by querying the NodeManager node on which this container
ran on.

Following sets of API calls are made:

+API calls made when the _Threaddump_ button is clicked:+
{code}
1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g.
app state from RM,
2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application
attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING
or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]).
{code}

+API calls made when the app attempt is selected from the drop-down:+
{code}
3.
http://:8088/ws/v1/cluster/apps//appattempts//containers
-> This is to get the list of running containers for the currently running app
attempt from the RM.
{code}

+API calls made when the container is selected from the drop-down:+
{code}
4.
http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name=
-> This is for RM (that eventually calls NM through NM heartbeat) to send a
SIGQUIT signal to the container process for the selected container
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is
essentially a kill -3 and it generates a thread dump that are captured in the
stdout logs of the container.
5. http://:8042/ws/v1/node/containerlogs//stdout -> This is
for the NM that is running the selected container to acquire the stdout logs
from this running container that contains the thread dump by the above call.
{code}

Following sets of API calls are made:

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819
]

Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:24 AM:
-

Following sets of API calls are made:

was (Author: sahuja):
This JIRA implements a feature for the addition of a "*Threaddump*" button on
the ResourceManager Web UI's individual application page accessible by visiting
RM Web UI -> Applications -> Click on (So, the breadcrumb would be
{{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps
for running YARN containers for a currently running application attempt. The
thread dumps are captured as part of the stdout logs for the selected container
and displayed as-is by querying the NodeManager node on which this container
ran on.

Following sets of API calls are made:

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823
]

Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:27 AM:
-

Testing done on the platform:

*+1. Test Threaddump collection for non-RUNNING app:+*

a. Ensure there is a YARN application that is already present
from a previous run and is NOT currently RUNNING.
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the non-running app. Jstack button should be visible.
c. Click on Threaddump button. Error message should be
displayed -> "Threaddump cannot be collected for an application that is not
running." because it is not possible to collect Jstack for a non-running
application as it has no running containers.

*+2. Test for Threaddump collection for a RUNNING app:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Threaddump button should be visible.
c. Click on Threaddump button. A new Threaddump panel with a
drop-down that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Repeat step e. from above for another container. A thread
dump should be captured and visible in the panel containing the stdout logs.
g. Go back and repeat step e. for the same container that was
first selected. Notice that 2 thread dumps are now present in the stdout logs
with the latest thread dump shown later in the stdout logs.

*+3. Error checking - Jstack fetch attempt for a container that is not running
due to killed application:+*

a. Kill the currently RUNNING application using: yarn
application -kill ,
b. Now try selecting a container from the drop-down containing
containers listing. Jstack collection is not possible and hence the error is
displayed -> "Threaddump fetch failed for container: due
to: “Trying to signal an absent container ”.

*+4. Error checking -Threaddump fetch attempt for a container while RMs/NMs not
available:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Threaddump button should be visible.
c. Click on Threaddump button. A new Threaddump panel with a
drop-down that has the options - "None" and
"" should be shown,
d. Select the currently running app attempt from the drop-down.
A new drop-down that shows currently running containers for this app attempt
should be shown in the drop-down panel,
e. Select a container from this drop-down. A new panel with the
header that shows the selected container and select attempt-id should be shown
along with Stdout logs for this container containing the thread dump from this
container.
f. Stop the ResourceManager/s.
g. Select a different container from the drop-down list. An
error should be displayed -> "Threaddump fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
h. Restart the ResourceManager/s.
i. Repeat steps a. until e.
j. Stop NodeManager/s.
k. Select a different container from the drop-down list. An
error should be displayed -> "Logs fetch failed for container:
due to: “Error: Not able to connect to YARN!”".
l. Start back the NodeManager/s.

*+5. Check latest (and the ONLY) running app attempt id is displayed:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Jstack button should be visible.
c. Click on Jstack button. A new Jstack panel with a drop-down
that has the options - "None" and
"" should be shown,
d. Now, run the following command to terminate the currently
running AM:

yarn container -signal
GRACEFUL_SHUTDOWN

e. Run the following command to check the currently running
app_attempt_id:

yarn

[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-25 Thread Siddharth Ahuja (Jira)

[
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823
]

Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:29 AM:
-

Testing done on the platform:

*+1. Test Threaddump collection for non-RUNNING app:+*

a. Ensure there is a YARN application that is already present
from a previous run and is NOT currently RUNNING.
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the non-running app. Threaddump button should be
visible.
c. Click on Threaddump button. Error message should be
displayed -> "Threaddump cannot be collected for an application that is not
running." because it is not possible to collect Threaddump for a non-running
application as it has no running containers.

*+3. Error checking - Threaddump fetch attempt for a container that is not
running due to killed application:+*

a. Kill the currently RUNNING application using: yarn
application -kill ,
b. Now try selecting a container from the drop-down containing
containers listing. Threaddump collection is not possible and hence the error
is displayed -> "Threaddump fetch failed for container:
due to: “Trying to signal an absent container ”.

*+5. Check latest (and the ONLY) running app attempt id is displayed:+*
a. Ensure there is a YARN application that is currently in
RUNNING state,
b. Visit ResourceManager Web UI -> Applications -> Click on
application_id link for the running app. Threaddump button should be visible.
c. Click on Threaddump button. A new Threaddump panel with a
drop-down that has the options - "None" and
"" should be shown,
d. Now, run the following command to terminate the currently
running AM:

yarn container -signal
GRACEFUL_SHUTDOWN

e. Run the following command to check the currently running
app_attempt_id:

[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-03 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170462#comment-17170462
 ] 

Siddharth Ahuja edited comment on YARN-10381 at 8/4/20, 12:11 AM:
--

Thanks [~BilwaST], I've fixed up the tests.

Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for 
reminding. I am working on an update.


was (Author: sahuja):
Thanks [~BilwaST], I've fixed up the tests.

Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for 
reminding. I am ready with the update, however, having some compilation 
failures on trunk probably coming from a different jira so I will wait before 
the next patch is uploaded.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-03 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170462#comment-17170462
 ] 

Siddharth Ahuja commented on YARN-10381:


Thanks [~BilwaST], I've fixed up the tests.

Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for 
reminding. I am ready with the update, however, having some compilation 
failures on trunk probably coming from a different jira so I will wait before 
the next patch is uploaded.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-04 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10381:
---
Attachment: YARN-10381.003.patch

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch, 
> YARN-10381.003.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-04 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171200#comment-17171200
 ] 

Siddharth Ahuja commented on YARN-10381:


Thanks [~prabhujoseph]!

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch, 
> YARN-10381.003.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-02 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10381:
---
Attachment: YARN-10381.002.patch

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159
 ] 

Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:34 PM:
--

Before this change, the following REST API call to RM:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706
{code}

produced the following output:

{code}


1
1596231023017
0
container_1596230988596_0001_01_01
localhost:8042
localhost:61871
http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin


appattempt_1596230988596_0001_01
null


{code}

Notice above that there is no state element for the application attempt.

Update for this jira (my change) involves adding appAttemptState to 
AppAttemptInfo object. Tested this on single node cluster by visiting 
http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
RUNNING


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.

Thanks to [~prabhujoseph] for the hint.


was (Author: sahuja):
Before this change, the following REST API call to RM:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706
{code}

produced the following output:

{code}


1
1596231023017
0
container_1596230988596_0001_01_01
localhost:8042
localhost:61871
http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin


appattempt_1596230988596_0001_01
null


{code}

Notice above that there is no state element for the application attempt.

Update for this jira (my change) involves adding appAttemptState to 
AppAttemptInfo object. Tested this on single node cluster by visiting 
http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
RUNNING


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159
 ] 

Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:34 PM:
--

Before this change, the following REST API call to RM:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706
{code}

produced the following output:

{code}


1
1596231023017
0
container_1596230988596_0001_01_01
localhost:8042
localhost:61871
http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin


appattempt_1596230988596_0001_01
null


{code}

Notice above that there is no state element for the application attempt.

Update for this jira (my change) involves adding appAttemptState to 
AppAttemptInfo object. Tested this on single node cluster by visiting 
http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
RUNNING


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.


was (Author: sahuja):
Added appAttemptState to AppAttemptInfo object and tested on single node 
cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
RUNNING


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159
 ] 

Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:14 PM:
--

Added appAttemptState to AppAttemptInfo object and tested on single node 
cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
RUNNING


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.


was (Author: sahuja):
Added appAttemptState to AppAttemptInfo object and tested on single node 
cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call:

{code}
http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909
{code}

in browser:

{code}


1
1596229888259
0
container_1596229056065_0002_01_01
localhost:8042
localhost:54250
http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin


appattempt_1596229056065_0002_01
null
*RUNNING*


{code}

It can be seen from above that the response contains appAttemptState which is 
RUNNING for a currently running attempt.

I did not find any specific tests for any attributes e.g. logsLink etc. 
Considering this is just a minor update, not sure if any junit testing is 
required.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-1806) webUI update to allow end users to request thread dump

2020-06-09 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-1806:
-

Assignee: Siddharth Ahuja

> webUI update to allow end users to request thread dump
> --
>
> Key: YARN-1806
> URL: https://issues.apache.org/jira/browse/YARN-1806
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ming Ma
>Assignee: Siddharth Ahuja
>Priority: Major
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)

Siddharth Ahuja created YARN-10381:
--

 Summary: Send out application attempt state along with other 
elements in the application attempt object returned from appattempts REST API 
call
 Key: YARN-10381
 URL: https://issues.apache.org/jira/browse/YARN-10381
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Siddharth Ahuja


The [ApplicationAttempts RM REST 
API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
 :

{code}
http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
{code}

returns a collection of Application Attempt objects, where each application 
attempt object contains elements like id, nodeId, startTime etc.

This JIRA has been raised to send out Application Attempt state as well as part 
of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10381:
--

Assignee: Siddharth Ahuja

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10381:
---
Component/s: yarn-ui-v2

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-07-31 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10381:
---
Affects Version/s: 3.3.0

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9454) Add detailed log about list applications command

2020-08-16 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-9454:
-

Assignee: Siddharth Ahuja

> Add detailed log about list applications command
> 
>
> Key: YARN-9454
> URL: https://issues.apache.org/jira/browse/YARN-9454
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Major
>
> When a user lists YARN applications with the RM admin CLI, we have one audit 
> log here 
> (https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L924)
> However, a more extensive logging could be added.
> This is the call chain, when such a list command got executed (from bottom to 
> top):
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl#getApplications(java.util.Set,
>  java.util.EnumSet, 
> java.util.Set)
> ApplicationCLI.listApplications(Set, EnumSet, 
> Set)  (org.apache.hadoop.yarn.client.cli)
> ApplicationCLI.run(String[])  (org.apache.hadoop.yarn.client.cli)
> {code}
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications:
>  This is the place that fits perfectly for adding a more detailed log message 
> about the request or the response (or both).
> In my opinion, a trace (or debug) level log would be great at the end of this 
> method, logging the whole response, so any potential issues with the code can 
> be troubleshot more easily. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-17 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251020#comment-17251020
 ] 

Siddharth Ahuja commented on YARN-10528:


Thank you [~snemeth]! Please take your time.

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-21 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253159#comment-17253159
 ] 

Siddharth Ahuja commented on YARN-10528:


Hi [~snemeth], trunk and 3.3 are all good, whereas, test failures coming from 
3.2 and 3.1 are not related to my changes. As such, I believe I am good here. 
Please feel free to review when you get a chance, thanks!

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.1.001.patch, 
> YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, 
> YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.

2020-12-23 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10545:
--

Assignee: Siddharth Ahuja

> Improve the readability of diagnostics log in yarn-ui2 web page.
> 
>
> Key: YARN-10545
> URL: https://issues.apache.org/jira/browse/YARN-10545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Reporter: akiyamaneko
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: Diagnostics shows unreadble.png
>
>
> If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces 
> will not be displayed, which is hard to read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528-branch-3.1.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.1.001.patch, 
> YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, 
> YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-20 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252632#comment-17252632
 ] 

Siddharth Ahuja commented on YARN-10528:


Hey [~snemeth],

Nice catch! Indeed, if there is no exception thrown from the source code even 
with a test setup that is designed to have the exception thrown (because 
maxAMShare is defined inside a parent queue), then, the tests would still pass 
incorrectly because there is no fail(...) in test case logic to double check 
the "throwing of the exception". As such, these tests will fail to identify a 
bad update to the source code that may not result in any exception. 

I have gone ahead and updated the tests as per your suggestion now.

In regards to doing a backport to the earlier branches, I will apply the patch 
to these now and run the JUnits. Once these are passed I will upload the 
patches for the relative branches as well soon.

Thanks again for reviewing!

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare 
> for root.users (parent queue) has no effect as child queue does not inherit 
> it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528-branch-3.2.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.2.001.patch, 
> YARN-10528-branch-3.3.001.patch, YARN-10528.001.patch, YARN-10528.002.patch, 
> maxAMShare for root.users (parent queue) has no effect as child queue does 
> not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528.002.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare 
> for root.users (parent queue) has no effect as child queue does not inherit 
> it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528-branch-3.3.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.3.001.patch, YARN-10528.001.patch, 
> YARN-10528.002.patch, maxAMShare for root.users (parent queue) has no effect 
> as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-21 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528-branch-3.3.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.1.001.patch, 
> YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, 
> YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-21 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: (was: YARN-10528-branch-3.3.001.patch)

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528-branch-3.1.001.patch, 
> YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, 
> YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-16 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-16 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250694#comment-17250694
 ] 

Siddharth Ahuja commented on YARN-10528:


Above failures have nothing to do with my patch, just gonna wait until issues 
with the pre-commit build are fixed for other deliveries here - 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/.

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-16 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: (was: YARN-10528.001.patch)

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: maxAMShare for root.users (parent queue) has no effect 
> as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-11 Thread Siddharth Ahuja (Jira)

Siddharth Ahuja created YARN-10528:
--

 Summary: maxAMShare should only be accepted for leaf queues, not 
parent queues
 Key: YARN-10528
 URL: https://issues.apache.org/jira/browse/YARN-10528
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siddharth Ahuja


Based on [Hadoop 
documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
 it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
This is similar to the {{reservation}} setting.

However, existing code only ensures that the reservation setting is not 
accepted for "parent" queues (see 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
 and 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
 but it is missing the checks for {{maxAMShare}}. Due to this, it is current 
possible to have an allocation similar to below:

{code}



1.0
drf
*
*

1.0
drf


1.0
drf
1.0


fair








{code}

where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
queue's resources for Application Masters. Notice above that root.users is a 
parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary 
to the documentation and in fact, it is very misleading because the child 
queues like root.users. actually do not inherit this setting at all and 
they still go on and use the default of 0.5 instead of 1.0, see the attached 
screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-11 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10528:
--

Assignee: Siddharth Ahuja

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is current 
> possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-11 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Description: 
Based on [Hadoop 
documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
 it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
This is similar to the {{reservation}} setting.

However, existing code only ensures that the reservation setting is not 
accepted for "parent" queues (see 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
 and 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
 but it is missing the checks for {{maxAMShare}}. Due to this, it is currently 
possible to have an allocation similar to below:

{code}



1.0
drf
*
*

1.0
drf


1.0
drf
1.0


fair








{code}

where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
queue's resources for Application Masters. Notice above that root.users is a 
parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary 
to the documentation and in fact, it is very misleading because the child 
queues like root.users. actually do not inherit this setting at all and 
they still go on and use the default of 0.5 instead of 1.0, see the attached 
screenshot as an example.

  was:
Based on [Hadoop 
documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
 it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
This is similar to the {{reservation}} setting.

However, existing code only ensures that the reservation setting is not 
accepted for "parent" queues (see 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
 and 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
 but it is missing the checks for {{maxAMShare}}. Due to this, it is current 
possible to have an allocation similar to below:

{code}



1.0
drf
*
*

1.0
drf


1.0
drf
1.0


fair








{code}

where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
queue's resources for Application Masters. Notice above that root.users is a 
parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary 
to the documentation and in fact, it is very misleading because the child 
queues like root.users. actually do not inherit this setting at all and 
they still go on and use the default of 0.5 instead of 1.0, see the attached 
screenshot as an example.


> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: maxAMShare for root.users (parent queue) has no effect 
> as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
>

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-11 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: maxAMShare for root.users (parent queue) has no effect as child 
queue does not inherit it.png

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: maxAMShare for root.users (parent queue) has no effect 
> as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is current 
> possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:52 AM:
---

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set 
for a parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja commented on YARN-10528:


I have made the behaviour similar to the reservation element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that maxAMShare is not set for root.users but set for a 
parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
        
        
            1.0
            drf

[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:51 AM:
---

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set 
for a parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-16 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: (was: YARN-10528.001.patch)

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: maxAMShare for root.users (parent queue) has no effect 
> as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10552) Eliminate code duplication in SLSCapacityScheduler and SLSFairScheduler

2021-02-03 Thread Siddharth Ahuja (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277991#comment-17277991
 ] 

Siddharth Ahuja commented on YARN-10552:


Hey [~snemeth], thanks a lot for the de-duplication here! 

Few comments from my side:  

# SLSSchedulerCommons - Can we please explicitly assign a default value for the 
declared fields like metricsOn etc. and not rely on Java to assign one, just as 
a good programming style. 
# Class variables - metricsOn & schedulerMetrics could be marked as private in 
SLSSchedulerCommons, new getters should be defined that could be invoked within 
the individual scheduler classes instead of referring them directly from a 
separate object.
# The "Tracker" seems to be common to both schedulers as such we could move the 
declaration & initialization to the common SLSSchedulerCommons, implement 
getTracker() here that returns the tracker object and keep getTracker() in the 
individual schedulers (we have to, thanks to SchedulerWrapper) and just return 
the tracker by calling schedulerCommons.getTracker().
# //metrics off, //metrics on comments inside handle() in SLSSchedulerCommons 
don't seem to be adding much value so let's just remove them.
# appQueueMap was not present in SLSFairScheduler before (it was in 
SLSCapacityScheduler) however from 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L163,
 it seems that the super class of the schedulers - 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java#L159
 has this already. As such, do we really need to define a new map as a common 
map at all in SLSSchedulerCommons or can we somehow reuse the super class's 
map? It might need some code updates though.
# In regards to the above point, considering SLSFairScheduler did not 
previously have any of the following code in handle() method:

{code}
AppAttemptRemovedSchedulerEvent appRemoveEvent =
(AppAttemptRemovedSchedulerEvent) schedulerEvent;
appQueueMap.remove(appRemoveEvent.getApplicationAttemptID());
  } else if (schedulerEvent.getType() ==
  SchedulerEventType.APP_ATTEMPT_ADDED
  && schedulerEvent instanceof AppAttemptAddedSchedulerEvent) {
AppAttemptAddedSchedulerEvent appAddEvent =
(AppAttemptAddedSchedulerEvent) schedulerEvent;
SchedulerApplication app =
(SchedulerApplication) 
scheduler.getSchedulerApplications().get(appAddEvent.getApplicationAttemptId()
.getApplicationId());
appQueueMap.put(appAddEvent.getApplicationAttemptId(), app.getQueue()
.getQueueName());
{code}

Do you think this was a bug that wasn't earlier identified?


> Eliminate code duplication in SLSCapacityScheduler and SLSFairScheduler
> ---
>
> Key: YARN-10552
> URL: https://issues.apache.org/jira/browse/YARN-10552
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10552.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10123) Error message around yarn app -stop/start can be improved to highlight that an implementation at framework level is needed for the stop/start functionality to work

2021-05-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10123:
---
Attachment: YARN-10123.branch-3.2.001.patch

> Error message around yarn app -stop/start can be improved to highlight that 
> an implementation at framework level is needed for the stop/start 
> functionality to work
> ---
>
> Key: YARN-10123
> URL: https://issues.apache.org/jira/browse/YARN-10123
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, documentation
>Affects Versions: 3.2.1
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10123.001.patch, YARN-10123.branch-3.2.001.patch
>
>
> A "stop" on a YARN application fails with the below error:
> {code}
> # yarn app -stop application_1581294743321_0002 -appTypes SPARK
> 20/02/10 06:24:27 INFO client.RMProxy: Connecting to ResourceManager at 
> c3224-node2.squadron.support.hortonworks.com/172.25.34.128:8050
> 20/02/10 06:24:27 INFO client.AHSProxy: Connecting to Application History 
> server at c3224-node2.squadron.support.hortonworks.com/172.25.34.128:10200
> Exception in thread "main" java.lang.IllegalArgumentException: App admin 
> client class name not specified for type SPARK
> at 
> org.apache.hadoop.yarn.client.api.AppAdminClient.createAppAdminClient(AppAdminClient.java:76)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:579)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:123)
> {code}
> From 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L76,
>  it seems that this is because user does not have the setting:
> {code}
> yarn.application.admin.client.class.SPARK
> {code}
> set up in their client configuration.
> However, even if this setting is present, we still need to have an 
> implementation available for the application type. From my internal 
> discussions - Jobs don't have a notion of stop / resume functionality at YARN 
> level. If some apps like Spark need it, it has to be implemented at those 
> framework's level.
> Therefore, the above error message is a bit misleading in that, even if 
> "yarn.application.admin.client.class.SPARK" is supplied (or for that matter - 
> yarn.application.admin.client.class.MAPREDUCE), if there is no implementation 
> actually available underneath to handle the stop/start functionality then, we 
> will fail again, albeit with a different error here: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L85.
> As such, maybe this error message can be potentially improved to say 
> something like:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: App admin 
> client class name not specified for type SPARK. Please ensure the App admin 
> client class actually exists within SPARK to handle this functionality.
> {code}
> or something similar.
> Further, documentation around "-stop" and "-start" options will need to be 
> improved here -> 
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app
>  as it does not mention anything about having an implementation at the 
> framework level for the YARN stop/start command to succeed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10123) Error message around yarn app -stop/start can be improved to highlight that an implementation at framework level is needed for the stop/start functionality to work

2021-05-20 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10123:
---
Attachment: YARN-10123.branch-3.3.001.patch

> Error message around yarn app -stop/start can be improved to highlight that 
> an implementation at framework level is needed for the stop/start 
> functionality to work
> ---
>
> Key: YARN-10123
> URL: https://issues.apache.org/jira/browse/YARN-10123
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, documentation
>Affects Versions: 3.2.1
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10123.001.patch, YARN-10123.branch-3.2.001.patch, 
> YARN-10123.branch-3.3.001.patch
>
>
> A "stop" on a YARN application fails with the below error:
> {code}
> # yarn app -stop application_1581294743321_0002 -appTypes SPARK
> 20/02/10 06:24:27 INFO client.RMProxy: Connecting to ResourceManager at 
> c3224-node2.squadron.support.hortonworks.com/172.25.34.128:8050
> 20/02/10 06:24:27 INFO client.AHSProxy: Connecting to Application History 
> server at c3224-node2.squadron.support.hortonworks.com/172.25.34.128:10200
> Exception in thread "main" java.lang.IllegalArgumentException: App admin 
> client class name not specified for type SPARK
> at 
> org.apache.hadoop.yarn.client.api.AppAdminClient.createAppAdminClient(AppAdminClient.java:76)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:579)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:123)
> {code}
> From 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L76,
>  it seems that this is because user does not have the setting:
> {code}
> yarn.application.admin.client.class.SPARK
> {code}
> set up in their client configuration.
> However, even if this setting is present, we still need to have an 
> implementation available for the application type. From my internal 
> discussions - Jobs don't have a notion of stop / resume functionality at YARN 
> level. If some apps like Spark need it, it has to be implemented at those 
> framework's level.
> Therefore, the above error message is a bit misleading in that, even if 
> "yarn.application.admin.client.class.SPARK" is supplied (or for that matter - 
> yarn.application.admin.client.class.MAPREDUCE), if there is no implementation 
> actually available underneath to handle the stop/start functionality then, we 
> will fail again, albeit with a different error here: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L85.
> As such, maybe this error message can be potentially improved to say 
> something like:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: App admin 
> client class name not specified for type SPARK. Please ensure the App admin 
> client class actually exists within SPARK to handle this functionality.
> {code}
> or something similar.
> Further, documentation around "-stop" and "-start" options will need to be 
> improved here -> 
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app
>  as it does not mention anything about having an implementation at the 
> framework level for the YARN stop/start command to succeed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10770) container-executor permission is wrong in SecureContainer.md

2021-05-24 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja reassigned YARN-10770:
--

Assignee: Siddharth Ahuja

> container-executor permission is wrong in SecureContainer.md
> 
>
> Key: YARN-10770
> URL: https://issues.apache.org/jira/browse/YARN-10770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Assignee: Siddharth Ahuja
>Priority: Major
>  Labels: newbie
>
> {noformat}
>   The `container-executor` program must be owned by `root` and have the 
> permission set `---sr-s---`.
> {noformat}
> It should be 6050 {noformat}---Sr-s---{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10770) container-executor permission is wrong in SecureContainer.md

2021-05-24 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10770:
---
Attachment: YARN-10770.001.patch

> container-executor permission is wrong in SecureContainer.md
> 
>
> Key: YARN-10770
> URL: https://issues.apache.org/jira/browse/YARN-10770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Assignee: Siddharth Ahuja
>Priority: Major
>  Labels: newbie
>
> {noformat}
>   The `container-executor` program must be owned by `root` and have the 
> permission set `---sr-s---`.
> {noformat}
> It should be 6050 {noformat}---Sr-s---{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10770) container-executor permission is wrong in SecureContainer.md

2021-05-24 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10770:
---
Attachment: (was: YARN-10770.001.patch)

> container-executor permission is wrong in SecureContainer.md
> 
>
> Key: YARN-10770
> URL: https://issues.apache.org/jira/browse/YARN-10770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Assignee: Siddharth Ahuja
>Priority: Major
>  Labels: newbie
>
> {noformat}
>   The `container-executor` program must be owned by `root` and have the 
> permission set `---sr-s---`.
> {noformat}
> It should be 6050 {noformat}---Sr-s---{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10839) queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps setting to this value ignoring any individually overriden maxRunningApps setting for child queue

2021-06-30 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10839:
---
Component/s: yarn

> queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps 
> setting to this value ignoring any individually overriden maxRunningApps 
> setting for child queues in FairScheduler
> 
>
> Key: YARN-10839
> URL: https://issues.apache.org/jira/browse/YARN-10839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.5, 3.3.1
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
>
> [queueMaxAppsDefault|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format]
>  sets the default running app limit for queues (including the root queue) 
> which can be overridden by individual child queues through the maxRunningApps 
> setting.
> Consider a simple FairScheduler XML as follows:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1024000 mb, 1000 vcores
> 15
> 2.0
> drf
> 
> 
> 512000 mb, 500 vcores
> 10
> 1.0
> drf
> 
> 
> 3
> drf
> 
> 
> 
> 
> 
> {code}
> Here:
> * {{queueMaxAppsDefault}} is set to 3 {{maxRunningApps}} by default.
> * root queue does not have any maxRunningApps limit set,
> * maxRunningApps for child queues - root.A is 15 and for root.B is 10.
> From above, if users wants to submit jobs to root.B, they are (incorrectly) 
> capped to 3, not 15 because the root queue (parent) itself is capped to 3 
> because of the queueMaxAppsDefault setting.
> Users' observations are thus seeing their apps stuck in ACCEPTED state.
> Either the above FairScheduler XML should have been rejected by the 
> ResourceManager, or, the root queue should have been capped to the maximum 
> maxRunningApps setting defined for a leaf queue. 
> Possible solution -> If root queue has no maxRunningApps set and 
> queueMaxAppsDefault is set to a lower value than maxRunningApps for an 
> individual leaf queue, then, the root queue should implicitly be capped to 
> the latter, instead of queueMaxAppsDefault.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10839) queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps setting to this value ignoring any individually overriden maxRunningApps setting for child queue

2021-06-30 Thread Siddharth Ahuja (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10839:
---
Labels: scheduler  (was: )

> queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps 
> setting to this value ignoring any individually overriden maxRunningApps 
> setting for child queues in FairScheduler
> 
>
> Key: YARN-10839
> URL: https://issues.apache.org/jira/browse/YARN-10839
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.5, 3.3.1
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
>  Labels: scheduler
>
> [queueMaxAppsDefault|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format]
>  sets the default running app limit for queues (including the root queue) 
> which can be overridden by individual child queues through the maxRunningApps 
> setting.
> Consider a simple FairScheduler XML as follows:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1024000 mb, 1000 vcores
> 15
> 2.0
> drf
> 
> 
> 512000 mb, 500 vcores
> 10
> 1.0
> drf
> 
> 
> 3
> drf
> 
> 
> 
> 
> 
> {code}
> Here:
> * {{queueMaxAppsDefault}} is set to 3 {{maxRunningApps}} by default.
> * root queue does not have any maxRunningApps limit set,
> * maxRunningApps for child queues - root.A is 15 and for root.B is 10.
> From above, if users wants to submit jobs to root.B, they are (incorrectly) 
> capped to 3, not 15 because the root queue (parent) itself is capped to 3 
> because of the queueMaxAppsDefault setting.
> Users' observations are thus seeing their apps stuck in ACCEPTED state.
> Either the above FairScheduler XML should have been rejected by the 
> ResourceManager, or, the root queue should have been capped to the maximum 
> maxRunningApps setting defined for a leaf queue. 
> Possible solution -> If root queue has no maxRunningApps set and 
> queueMaxAppsDefault is set to a lower value than maxRunningApps for an 
> individual leaf queue, then, the root queue should implicitly be capped to 
> the latter, instead of queueMaxAppsDefault.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

< 1 2 3 >

101 - 200 of 253 matches

Mail list logo