[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Attachment: (was: YARN-10063.004.patch) > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10207.001.patch, YARN-10207.002.patch, > YARN-10207.003.patch, YARN-10207.004.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957 ] Siddharth Ahuja commented on YARN-10207: Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings (see the attached screenshot). I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: Indentation settings.png, YARN-10063.004.patch, > YARN-10207.001.patch, YARN-10207.002.patch, YARN-10207.003.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at >
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Attachment: (was: Indentation settings.png) > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10063.004.patch, YARN-10207.001.patch, > YARN-10207.002.patch, YARN-10207.003.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957 ] Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:02 PM: -- Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings. I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! was (Author: sahuja): Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings (see the attached screenshot). I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10063.004.patch, YARN-10207.001.patch, > YARN-10207.002.patch, YARN-10207.003.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at >
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Attachment: YARN-10207.001.patch > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10207.001.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:50 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071621#comment-17071621 ] Siddharth Ahuja commented on YARN-10207: Fixing up checkstyle warnings as per https://builds.apache.org/job/PreCommit-YARN-Build/25787/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt. > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10207.001.patch, YARN-10207.002.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Attachment: YARN-10207.002.patch > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10207.001.patch, YARN-10207.002.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:36 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. The call happens in this fashion: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:35 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream if we do indeed hit any and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. The call happens in this fashion: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja commented on YARN-10207: Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader fails during instantiation inside AggregatedLogFormat.java, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream if we do indeed hit any and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. The call happens in this fashion: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Assigned] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message
[ https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-9355: - Assignee: Umesh (was: Siddharth Ahuja) > RMContainerRequestor#makeRemoteRequest has confusing log message > > > Key: YARN-9355 > URL: https://issues.apache.org/jira/browse/YARN-9355 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Umesh >Priority: Trivial > Labels: newbie, newbie++ > > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest > has this log: > {code:java} > if (ask.size() > 0 || release.size() > 0) { > LOG.info("getResources() for " + applicationId + ":" + " ask=" > + ask.size() + " release= " + release.size() + " newContainers=" > + allocateResponse.getAllocatedContainers().size() > + " finishedContainers=" + numCompletedContainers > + " resourcelimit=" + availableResources + " knownNMs=" > + clusterNmCount); > } > {code} > The reason why "getResources()" is printed because > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources > invokes makeRemoteRequest. This is not too informative and error-prone as > name of getResources could change over time and the log will be outdated. > Moreover, it's not a good idea to print a method name from a method below the > current one in the stack. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics
[ https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070557#comment-17070557 ] Siddharth Ahuja commented on YARN-5277: --- Thank you for the tool suggestion [~brahmareddy]! Kindly allow me some time to set this up internally and put out a formal patch and I will update the JIRA. Thanks again for your kind help. > when localizers fail due to resource timestamps being out, provide more > diagnostics > --- > > Key: YARN-5277 > URL: https://issues.apache.org/jira/browse/YARN-5277 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Siddharth Ahuja >Priority: Major > > When an NM fails a resource D/L as the timestamps are wrong, there's not much > info, just two long values. > It would be good to also include the local time values, *and the current wall > time*. These are the things people need to know when trying to work out what > went wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message
[ https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070651#comment-17070651 ] Siddharth Ahuja edited comment on YARN-9355 at 3/30/20, 2:50 AM: - Hi [~ykabusalah], not sure how this got pulled from under my name. I would have expected a check with me before to do that. I have recently started to get going on the JIRAs that were created by [~snemeth] at my work, as such, was gonna work this one in the near future. If you do still want to carry on with this one, please go ahead but I would appreciate checking it with the assignee before to just take it in future. was (Author: sahuja): Hi [~ykabusalah], not sure how this got pulled from under my name. I would have expected a check with me before to do that. I have recently started to get going on the JIRAs that were created by [~snemeth] at my work, as such, was gonna work this one in the near future. If you do still want to carry on with this one, please go ahead but I would appreciate checking it with the assignee before to just nick it in future. > RMContainerRequestor#makeRemoteRequest has confusing log message > > > Key: YARN-9355 > URL: https://issues.apache.org/jira/browse/YARN-9355 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Yousef Abu-Salah >Priority: Trivial > Labels: newbie, newbie++ > > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest > has this log: > {code:java} > if (ask.size() > 0 || release.size() > 0) { > LOG.info("getResources() for " + applicationId + ":" + " ask=" > + ask.size() + " release= " + release.size() + " newContainers=" > + allocateResponse.getAllocatedContainers().size() > + " finishedContainers=" + numCompletedContainers > + " resourcelimit=" + availableResources + " knownNMs=" > + clusterNmCount); > } > {code} > The reason why "getResources()" is printed because > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources > invokes makeRemoteRequest. This is not too informative and error-prone as > name of getResources could change over time and the log will be outdated. > Moreover, it's not a good idea to print a method name from a method below the > current one in the stack. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message
[ https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070651#comment-17070651 ] Siddharth Ahuja commented on YARN-9355: --- Hi [~ykabusalah], not sure how this got pulled from under my name. I would have expected a check with me before to do that. I have recently started to get going on the JIRAs that were created by [~snemeth] at my work, as such, was gonna work this one in the near future. If you do still want to carry on with this one, please go ahead but I would appreciate checking it with the assignee before to just nick it in future. > RMContainerRequestor#makeRemoteRequest has confusing log message > > > Key: YARN-9355 > URL: https://issues.apache.org/jira/browse/YARN-9355 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Yousef Abu-Salah >Priority: Trivial > Labels: newbie, newbie++ > > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest > has this log: > {code:java} > if (ask.size() > 0 || release.size() > 0) { > LOG.info("getResources() for " + applicationId + ":" + " ask=" > + ask.size() + " release= " + release.size() + " newContainers=" > + allocateResponse.getAllocatedContainers().size() > + " finishedContainers=" + numCompletedContainers > + " resourcelimit=" + availableResources + " knownNMs=" > + clusterNmCount); > } > {code} > The reason why "getResources()" is printed because > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources > invokes makeRemoteRequest. This is not too informative and error-prone as > name of getResources could change over time and the log will be outdated. > Moreover, it's not a good idea to print a method name from a method below the > current one in the stack. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message
[ https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-9355: - Assignee: Siddharth Ahuja (was: Yousef Abu-Salah) > RMContainerRequestor#makeRemoteRequest has confusing log message > > > Key: YARN-9355 > URL: https://issues.apache.org/jira/browse/YARN-9355 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Trivial > Labels: newbie, newbie++ > > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest > has this log: > {code:java} > if (ask.size() > 0 || release.size() > 0) { > LOG.info("getResources() for " + applicationId + ":" + " ask=" > + ask.size() + " release= " + release.size() + " newContainers=" > + allocateResponse.getAllocatedContainers().size() > + " finishedContainers=" + numCompletedContainers > + " resourcelimit=" + availableResources + " knownNMs=" > + clusterNmCount); > } > {code} > The reason why "getResources()" is printed because > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources > invokes makeRemoteRequest. This is not too informative and error-prone as > name of getResources could change over time and the log will be outdated. > Moreover, it's not a good idea to print a method name from a method below the > current one in the stack. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9355) RMContainerRequestor#makeRemoteRequest has confusing log message
[ https://issues.apache.org/jira/browse/YARN-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070679#comment-17070679 ] Siddharth Ahuja commented on YARN-9355: --- Thanks [~ykabusalah], no worries! > RMContainerRequestor#makeRemoteRequest has confusing log message > > > Key: YARN-9355 > URL: https://issues.apache.org/jira/browse/YARN-9355 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Yousef Abu-Salah >Priority: Trivial > Labels: newbie, newbie++ > > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor#makeRemoteRequest > has this log: > {code:java} > if (ask.size() > 0 || release.size() > 0) { > LOG.info("getResources() for " + applicationId + ":" + " ask=" > + ask.size() + " release= " + release.size() + " newContainers=" > + allocateResponse.getAllocatedContainers().size() > + " finishedContainers=" + numCompletedContainers > + " resourcelimit=" + availableResources + " knownNMs=" > + clusterNmCount); > } > {code} > The reason why "getResources()" is printed because > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator#getResources > invokes makeRemoteRequest. This is not too informative and error-prone as > name of getResources could change over time and the log will be outdated. > Moreover, it's not a good idea to print a method name from a method below the > current one in the stack. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Description: Issue reproduced using the following steps: # Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026. # Copied an aggregated log file from HDFS to local FS: {code} hdfs dfs -get /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Updated the TFile metadata at the bottom of this file with some junk to corrupt the file : *Before:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP {code} *After:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah {code} Notice "blah" (junk) added at the very end. # Remove the existing aggregated log file that will need to be replaced by our modified copy from step 3 (as otherwise HDFS will prevent it from placing the file with the same name as it already exists): {code} hdfs dfs -rm -r -f /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Upload the corrupted aggregated file back to HDFS: {code} hdfs dfs -put _8041 /tmp/logs/systest/logs/application_1582676649923_0026 {code} # Visit HistoryServer Web UI # Click on job_1582676649923_0026 # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) # Review the JHS logs, following exception will be seen: {code} 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1582676649923_0026 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) at
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Description: Issue reproduced using the following steps: # Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026. # Copied an aggregated log file from HDFS to local FS: {code} hdfs dfs -get /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Updated the TFile metadata at the bottom of this file with some junk to corrupt the file : *Before:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP {code} *After:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah {code} Notice "blah" (junk) added at the very end. # Remove the existing aggregated log file that will need to be replaced by our modified copy from step 3 (as otherwise HDFS will prevent it from placing the file with the same name as it already exists): {code} hdfs dfs -rm -r -f /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Upload the corrupted aggregated file back to HDFS: {code} hdfs dfs -put _8041 /tmp/logs/systest/logs/application_1582676649923_0026 {code} # Visit HistoryServer Web UI # Click on job_1582676649923_0026 # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) # Review the JHS logs, following exception will be seen: {code} 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1582676649923_0026 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Description: Issue reproduced using the following steps: # Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026. # Copied an aggregated log file from HDFS to local FS: {code} hdfs dfs -get /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Updated the TFile metadata at the bottom of this file with some junk to corrupt the file : *Before:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP {code} *After:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah {code} Notice "blah" added at the very end. # Remove the existing aggregated log file that will need to be replaced by our modified copy from step 3 (as otherwise HDFS will prevent it from placing the file with the same name as it already exists): {code} hdfs dfs -rm -r -f /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Upload the corrupted aggregated file back to HDFS: {code} hdfs dfs -put _8041 /tmp/logs/systest/logs/application_1582676649923_0026 {code} # Visit HistoryServer Web UI # Click on job_1582676649923_0026 # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) # Review the JHS logs, following exception will be seen: {code} 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1582676649923_0026 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
[jira] [Assigned] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10207: -- Assignee: Siddharth Ahuja > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR > org.apache.hadoop.yarn.webapp.View: Error getting logs for > job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at > org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Created] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
Siddharth Ahuja created YARN-10207: -- Summary: CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI Key: YARN-10207 URL: https://issues.apache.org/jira/browse/YARN-10207 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Siddharth Ahuja Issue reproduced using the following steps: # Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026. # Copied an aggregated log file from HDFS to local FS: {code} hdfs dfs -get /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Updated the TFile metadata at the bottom of this file with some junk to corrupt the file : *Before:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP {code} *After:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah {code} Notice "blah" added at the very end. # Remove the existing aggregated log file that will need to be replaced by our modified copy from step 3 (as otherwise HDFS will prevent it from placing the file with the same name as it already exists): {code} hdfs dfs -rm -r -f /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Upload the corrupted aggregated file back to HDFS: {code} hdfs dfs -put _8041 /tmp/logs/systest/logs/application_1582676649923_0026 {code} # Visit HistoryServer Web UI # Click on job_1582676649923_0026 # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) # Review the JHS logs, following exception will be seen: {code} 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1582676649923_0026 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
[jira] [Updated] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10207: --- Description: File descriptor leaks are observed coming from the JobHistoryServer process while it tries to render a "corrupted" aggregated log on the JHS Web UI. Issue reproduced using the following steps: # Ran a sample Hadoop MR Pi job, it had the id - application_1582676649923_0026. # Copied an aggregated log file from HDFS to local FS: {code} hdfs dfs -get /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Updated the TFile metadata at the bottom of this file with some junk to corrupt the file : *Before:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP {code} *After:* {code} ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah {code} Notice "blah" (junk) added at the very end. # Remove the existing aggregated log file that will need to be replaced by our modified copy from step 3 (as otherwise HDFS will prevent it from placing the file with the same name as it already exists): {code} hdfs dfs -rm -r -f /tmp/logs/systest/logs/application_1582676649923_0026/_8041 {code} # Upload the corrupted aggregated file back to HDFS: {code} hdfs dfs -put _8041 /tmp/logs/systest/logs/application_1582676649923_0026 {code} # Visit HistoryServer Web UI # Click on job_1582676649923_0026 # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) # Review the JHS logs, following exception will be seen: {code} 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1582676649923_0026 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) at org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) at
[jira] [Updated] (YARN-9996) Code cleanup in QueueAdminConfigurationMutationACLPolicy
[ https://issues.apache.org/jira/browse/YARN-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-9996: -- Attachment: YARN-9996-branch-3.2.001.patch > Code cleanup in QueueAdminConfigurationMutationACLPolicy > > > Key: YARN-9996 > URL: https://issues.apache.org/jira/browse/YARN-9996 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9996-branch-3.2.001.patch, > YARN-9996-branch-3.2.001.patch, YARN-9996-branch-3.2.001.patch, > YARN-9996-branch-3.3.001.patch, YARN-9996.001.patch > > > Method 'isMutationAllowed' contains many uses of substring and lastIndexOf. > These could be extracted and simplified. > Also, some logging could be added as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9996) Code cleanup in QueueAdminConfigurationMutationACLPolicy
[ https://issues.apache.org/jira/browse/YARN-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091002#comment-17091002 ] Siddharth Ahuja commented on YARN-9996: --- Thank you [~snemeth]! > Code cleanup in QueueAdminConfigurationMutationACLPolicy > > > Key: YARN-9996 > URL: https://issues.apache.org/jira/browse/YARN-9996 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.4.0 > > Attachments: YARN-9996-branch-3.2.001.patch, > YARN-9996-branch-3.2.001.patch, YARN-9996-branch-3.2.001.patch, > YARN-9996-branch-3.3.001.patch, YARN-9996.001.patch > > > Method 'isMutationAllowed' contains many uses of substring and lastIndexOf. > These could be extracted and simplified. > Also, some logging could be added as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer
[ https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176 ] Siddharth Ahuja edited comment on YARN-10075 at 3/22/20, 9:23 AM: -- Just uploaded a patch that does the following: # Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only usage of historyContext in the class was to be passed in as an argument during the instantiation of the HistoryClientService and nothing else. Therefore, it is now cleaned up and the HistoryClientService is now instantiated by casting the jobHistoryService with HistoryContext. # One test class - _TestJHSSecurity_ was found to be abusing this protected attribute during the creation of a jobHistoryServer inside this test class. The historyContext attribute was being referenced directly (bad) inside createHistoryClientService method during creation of the mock job history server. In fact, the only use of implementing this helper method seems to be passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) during the creation of the history client service. However, this is not required because jobHistoryServer.init(conf) will result in the same due to the serviceInit() call within JobHistoryServer that will call createHistoryClientService() which will end up using the custom jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) happens before createHistoryClientService()). # Removed a commented out line - _final JobHistoryServer jobHistoryServer = jhServer;_ from the test class. was (Author: sahuja): Just uploaded a patch that does the following: # Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only usage of historyContext in the class was to be passed in as an argument during the instantiation of the HistoryClientService and nothing else. Therefore, it is now cleaned up and the HistoryClientService is now instantiated by casting the jobHistoryService with HistoryContext. # One test class - _TestJHSSecurity_ was found to be abusing this protected attribute during the creation of a jobHistoryServer inside this test class. The historyContext attribute was being referenced directly (bad) inside createHistoryClientService method during creation of the mock job history server. In fact, the only use of implementing this helper method seems to be passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) during the creation of the history client service. However, this is not required because jobHistoryServer.init(conf) will result in the same due to the serviceInit() call within JobHistoryServer that will call createHistoryClientService() which will end up using the custom jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) happens before createHistoryClientService()). # Cleaned up an unused commented line - _final JobHistoryServer jobHistoryServer = jhServer;_ from the test class. > historyContext doesn't need to be a class attribute inside JobHistoryServer > --- > > Key: YARN-10075 > URL: https://issues.apache.org/jira/browse/YARN-10075 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10075.001.patch > > > "historyContext" class attribute at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > is assigned a cast of another class attribute - "jobHistoryService" - > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131, > however it does not need to be stored separately because it is only ever > used once in the clas, and that too as an argument while instantiating the > HistoryClientService class at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155. > Therefore, we could just delete the lines at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131 > completely and instantiate the HistoryClientService as follows:
[jira] [Comment Edited] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer
[ https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176 ] Siddharth Ahuja edited comment on YARN-10075 at 3/22/20, 9:24 AM: -- Just uploaded a patch that does the following: # Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only usage of historyContext in the class was to be passed in as an argument during the instantiation of the HistoryClientService and nothing else. Therefore, it is now cleaned up and the HistoryClientService is now instantiated by casting the jobHistoryService with HistoryContext. # One test class - _TestJHSSecurity_ was found to be abusing this protected attribute during the creation of a jobHistoryServer inside this test class. The historyContext attribute was being referenced directly (bad) inside createHistoryClientService method during creation of the mock job history server. In fact, the only use of implementing this helper method seems to be passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) during the creation of the history client service. However, this is not required because jobHistoryServer.init(conf) will result in the same due to the serviceInit() call within JobHistoryServer that will call createHistoryClientService() which will end up using the custom jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) happens before createHistoryClientService()). # Removed a commented out line - _final JobHistoryServer jobHistoryServer = jhServer;_ from the test class as it was near the code that was being cleaned up in 2. was (Author: sahuja): Just uploaded a patch that does the following: # Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only usage of historyContext in the class was to be passed in as an argument during the instantiation of the HistoryClientService and nothing else. Therefore, it is now cleaned up and the HistoryClientService is now instantiated by casting the jobHistoryService with HistoryContext. # One test class - _TestJHSSecurity_ was found to be abusing this protected attribute during the creation of a jobHistoryServer inside this test class. The historyContext attribute was being referenced directly (bad) inside createHistoryClientService method during creation of the mock job history server. In fact, the only use of implementing this helper method seems to be passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) during the creation of the history client service. However, this is not required because jobHistoryServer.init(conf) will result in the same due to the serviceInit() call within JobHistoryServer that will call createHistoryClientService() which will end up using the custom jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) happens before createHistoryClientService()). # Removed a commented out line - _final JobHistoryServer jobHistoryServer = jhServer;_ from the test class. > historyContext doesn't need to be a class attribute inside JobHistoryServer > --- > > Key: YARN-10075 > URL: https://issues.apache.org/jira/browse/YARN-10075 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10075.001.patch > > > "historyContext" class attribute at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > is assigned a cast of another class attribute - "jobHistoryService" - > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131, > however it does not need to be stored separately because it is only ever > used once in the clas, and that too as an argument while instantiating the > HistoryClientService class at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155. > Therefore, we could just delete the lines at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131 > completely and
[jira] [Updated] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer
[ https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10075: --- Attachment: YARN-10075.001.patch > historyContext doesn't need to be a class attribute inside JobHistoryServer > --- > > Key: YARN-10075 > URL: https://issues.apache.org/jira/browse/YARN-10075 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10075.001.patch > > > "historyContext" class attribute at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > is assigned a cast of another class attribute - "jobHistoryService" - > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131, > however it does not need to be stored separately because it is only ever > used once in the clas, and that too as an argument while instantiating the > HistoryClientService class at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155. > Therefore, we could just delete the lines at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131 > completely and instantiate the HistoryClientService as follows: > {code} > @VisibleForTesting > protected HistoryClientService createHistoryClientService() { > return new HistoryClientService((HistoryContext)jobHistoryService, > this.jhsDTSecretManager); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10075) historyContext doesn't need to be a class attribute inside JobHistoryServer
[ https://issues.apache.org/jira/browse/YARN-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064176#comment-17064176 ] Siddharth Ahuja commented on YARN-10075: Just uploaded a patch that does the following: # Removed "protected" attribute - _historyContext_ from JobHistoryServer. Only usage of historyContext in the class was to be passed in as an argument during the instantiation of the HistoryClientService and nothing else. Therefore, it is now cleaned up and the HistoryClientService is now instantiated by casting the jobHistoryService with HistoryContext. # One test class - _TestJHSSecurity_ was found to be abusing this protected attribute during the creation of a jobHistoryServer inside this test class. The historyContext attribute was being referenced directly (bad) inside createHistoryClientService method during creation of the mock job history server. In fact, the only use of implementing this helper method seems to be passing in the "custom" jhsDTSecretManager (JHSDelegationTokenSecretManager) during the creation of the history client service. However, this is not required because jobHistoryServer.init(conf) will result in the same due to the serviceInit() call within JobHistoryServer that will call createHistoryClientService() which will end up using the custom jhsDTSecretManager created just earlier (createJHSSecretManager(...,...) happens before createHistoryClientService()). # Cleaned up an unused commented line - _final JobHistoryServer jobHistoryServer = jhServer;_ from the test class. > historyContext doesn't need to be a class attribute inside JobHistoryServer > --- > > Key: YARN-10075 > URL: https://issues.apache.org/jira/browse/YARN-10075 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10075.001.patch > > > "historyContext" class attribute at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > is assigned a cast of another class attribute - "jobHistoryService" - > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131, > however it does not need to be stored separately because it is only ever > used once in the clas, and that too as an argument while instantiating the > HistoryClientService class at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L155. > Therefore, we could just delete the lines at > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L67 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/JobHistoryServer.java#L131 > completely and instantiate the HistoryClientService as follows: > {code} > @VisibleForTesting > protected HistoryClientService createHistoryClientService() { > return new HistoryClientService((HistoryContext)jobHistoryService, > this.jhsDTSecretManager); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064473#comment-17064473 ] Siddharth Ahuja commented on YARN-10001: Hi [~snemeth], I have added explanations for methods that have no implementation - _checkVersion, storeVersion_ and that return a null (i.e. methods that do nothing) - _getCurrentVersion, getConfStoreVersion, getLogs, getConfirmedConfHistory._ Kindly let me know if you are ok with the descriptions (+cc [~wilfreds]). > Add explanation of unimplemented methods in InMemoryConfigurationStore > -- > > Key: YARN-10001 > URL: https://issues.apache.org/jira/browse/YARN-10001 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10001.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064845#comment-17064845 ] Siddharth Ahuja edited comment on YARN-10001 at 3/23/20, 2:40 PM: -- This was the output from the earlier build: {code} -1 overall | Vote |Subsystem | Runtime | Comment | 0 | reexec | 0m 46s | Docker mode activated. | | || Prechecks | +1 | @author | 0m 0s | The patch does not contain any @author | | || tags. | -1 | test4tests | 0m 0s | The patch doesn't appear to include | | || any new or modified tests. Please | | || justify why no new tests are needed for | | || this patch. Also please list what | | || manual steps were performed to verify | | || this patch. | | || trunk Compile Tests | +1 | mvninstall | 21m 48s | trunk passed | +1 | compile | 0m 45s | trunk passed | +1 | checkstyle | 0m 35s | trunk passed | +1 | mvnsite | 0m 47s | trunk passed | +1 |shadedclient | 15m 31s | branch has no errors when building and | | || testing our client artifacts. | +1 |findbugs | 1m 35s | trunk passed | +1 | javadoc | 0m 30s | trunk passed | | || Patch Compile Tests | +1 | mvninstall | 0m 43s | the patch passed | +1 | compile | 0m 38s | the patch passed | +1 | javac | 0m 38s | the patch passed | -0 | checkstyle | 0m 27s | | | || hadoop-yarn-project/hadoop-yarn/hadoop-y | | || arn-server/hadoop-yarn-server-resourcema | | || nager: The patch generated 7 new + 1 | | || unchanged - 0 fixed = 8 total (was 1) | +1 | mvnsite | 0m 41s | the patch passed | +1 | whitespace | 0m 0s | The patch has no whitespace issues. | +1 |shadedclient | 14m 22s | patch has no errors when building and | | || testing our client artifacts. | +1 |findbugs | 1m 40s | the patch passed | +1 | javadoc | 0m 26s | the patch passed | | || Other Tests | +1 |unit | 103m 21s | hadoop-yarn-server-resourcemanager in | | || the patch passed. | +1 | asflicense | 0m 25s | The patch does not generate ASF | | || License warnings. | | | 164m 49s | {code} Note that the changes for this JIRA are only related to comments for methods, therefore, no new tests were added or modified (they don't need to). was (Author: sahuja): This was the output from the earlier build: {code} -1 overall | Vote |Subsystem | Runtime | Comment | 0 | reexec | 0m 46s | Docker mode activated. | | || Prechecks | +1 | @author | 0m 0s | The patch does not contain any @author | | || tags. | -1 | test4tests | 0m 0s | The patch doesn't appear to include | | || any new or modified tests. Please | | || justify why no new tests are needed for | | || this patch. Also please list what | | || manual steps were performed to verify | | || this patch. | | || trunk Compile Tests | +1 | mvninstall | 21m 48s | trunk passed | +1 | compile | 0m 45s | trunk passed | +1 | checkstyle | 0m 35s | trunk passed | +1 | mvnsite | 0m 47s | trunk passed | +1 |shadedclient | 15m 31s | branch has no errors when building and | | || testing our client artifacts. | +1 |findbugs | 1m 35s | trunk passed | +1 | javadoc | 0m 30s | trunk passed | | || Patch Compile Tests | +1 | mvninstall | 0m 43s | the patch passed | +1 | compile | 0m 38s | the patch passed | +1 | javac | 0m 38s | the patch passed | -0 | checkstyle | 0m 27s | | |
[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064845#comment-17064845 ] Siddharth Ahuja commented on YARN-10001: This was the output from the earlier build: {code} -1 overall | Vote |Subsystem | Runtime | Comment | 0 | reexec | 0m 46s | Docker mode activated. | | || Prechecks | +1 | @author | 0m 0s | The patch does not contain any @author | | || tags. | -1 | test4tests | 0m 0s | The patch doesn't appear to include | | || any new or modified tests. Please | | || justify why no new tests are needed for | | || this patch. Also please list what | | || manual steps were performed to verify | | || this patch. | | || trunk Compile Tests | +1 | mvninstall | 21m 48s | trunk passed | +1 | compile | 0m 45s | trunk passed | +1 | checkstyle | 0m 35s | trunk passed | +1 | mvnsite | 0m 47s | trunk passed | +1 |shadedclient | 15m 31s | branch has no errors when building and | | || testing our client artifacts. | +1 |findbugs | 1m 35s | trunk passed | +1 | javadoc | 0m 30s | trunk passed | | || Patch Compile Tests | +1 | mvninstall | 0m 43s | the patch passed | +1 | compile | 0m 38s | the patch passed | +1 | javac | 0m 38s | the patch passed | -0 | checkstyle | 0m 27s | | | || hadoop-yarn-project/hadoop-yarn/hadoop-y | | || arn-server/hadoop-yarn-server-resourcema | | || nager: The patch generated 7 new + 1 | | || unchanged - 0 fixed = 8 total (was 1) | +1 | mvnsite | 0m 41s | the patch passed | +1 | whitespace | 0m 0s | The patch has no whitespace issues. | +1 |shadedclient | 14m 22s | patch has no errors when building and | | || testing our client artifacts. | +1 |findbugs | 1m 40s | the patch passed | +1 | javadoc | 0m 26s | the patch passed | | || Other Tests | +1 |unit | 103m 21s | hadoop-yarn-server-resourcemanager in | | || the patch passed. | +1 | asflicense | 0m 25s | The patch does not generate ASF | | || License warnings. | | | 164m 49s | {code} Note that the changes for this JIRA are only comments, therefore, no new tests were added or modified (they don't need to). > Add explanation of unimplemented methods in InMemoryConfigurationStore > -- > > Key: YARN-10001 > URL: https://issues.apache.org/jira/browse/YARN-10001 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10001.001.patch, YARN-10001.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10001) Add explanation of unimplemented methods in InMemoryConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064836#comment-17064836 ] Siddharth Ahuja commented on YARN-10001: Found checkstyle warnings coming from https://builds.apache.org/job/PreCommit-YARN-Build/25734/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt. Got checkstyle checks imported as per https://github.com/apache/hadoop/tree/trunk/hadoop-build-tools/src/main/resources/checkstyle/ in IntelliJ and managed to receive the same warnings there so I should be good for future patches. Fixed them all up and delivering the new patch now. > Add explanation of unimplemented methods in InMemoryConfigurationStore > -- > > Key: YARN-10001 > URL: https://issues.apache.org/jira/browse/YARN-10001 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10001.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics
[ https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269 ] Siddharth Ahuja commented on YARN-5277: --- Hi [~aajisaka], I am working on this JIRA and have a potential fix/implementation in terms of non-test source code. However, I did have a question regarding the Junit code coverage tool -> _Clover_ . I tried to run the following command: {code} mvn test -Pclover {code} but it resulted in the following error: {code} Failed to execute goal com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on project hadoop-main: Failed to load resource as file [/Users//.clover.license]: Could not find resource '/Users/sidtheadmin/.clover.license'. -> [Help 1] that I tried to run to see if we are already covering the impacted code through Junit testing or not. I used the following command to run it: {code} I could try and supply a clover license through : {code} mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license] {code} as per https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, however, I need the clover.license. I somehow found a link where I could get that potentially - https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license but as I am not a committer, I don't have the credentials (I get asked for username/password). As such, can you kindly help me with a clover license? I am really interesting in getting this so that I know if we already have an existing test method in the test class that already covers what I am trying to modify and hence, I can just update that method. If it is not covered yet, then, I will have to write up a new junit test for that. Thanks in advance for your kind assistance! > when localizers fail due to resource timestamps being out, provide more > diagnostics > --- > > Key: YARN-5277 > URL: https://issues.apache.org/jira/browse/YARN-5277 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Siddharth Ahuja >Priority: Major > > When an NM fails a resource D/L as the timestamps are wrong, there's not much > info, just two long values. > It would be good to also include the local time values, *and the current wall > time*. These are the things people need to know when trying to work out what > went wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics
[ https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269 ] Siddharth Ahuja edited comment on YARN-5277 at 3/24/20, 2:54 AM: - Hi [~aajisaka], I am working on this JIRA and have a potential fix/implementation in terms of non-test source code. However, I did have a question regarding the Junit code coverage tool -> _Clover_ . I tried to run the following command: {code} mvn test -Pclover {code} but it resulted in the following error: {code} Failed to execute goal com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on project hadoop-main: Failed to load resource as file [/Users//.clover.license]: Could not find resource '/Users//.clover.license'. -> [Help 1] that I tried to run to see if we are already covering the impacted code through Junit testing or not. I used the following command to run it: {code} I could try and supply a clover license through : {code} mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license] {code} as per https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, however, I need the clover.license. I somehow found a link where I could get that potentially - https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license but as I am not a committer, I don't have the credentials (I get asked for username/password). As such, can you kindly help me with a clover license? I am really interesting in getting this so that I know if we already have an existing test method in the test class that already covers what I am trying to modify and hence, I can just update that method. If it is not covered yet, then, I will have to write up a new junit test for that. Thanks in advance for your kind assistance! was (Author: sahuja): Hi [~aajisaka], I am working on this JIRA and have a potential fix/implementation in terms of non-test source code. However, I did have a question regarding the Junit code coverage tool -> _Clover_ . I tried to run the following command: {code} mvn test -Pclover {code} but it resulted in the following error: {code} Failed to execute goal com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on project hadoop-main: Failed to load resource as file [/Users//.clover.license]: Could not find resource '/Users/sidtheadmin/.clover.license'. -> [Help 1] that I tried to run to see if we are already covering the impacted code through Junit testing or not. I used the following command to run it: {code} I could try and supply a clover license through : {code} mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license] {code} as per https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, however, I need the clover.license. I somehow found a link where I could get that potentially - https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license but as I am not a committer, I don't have the credentials (I get asked for username/password). As such, can you kindly help me with a clover license? I am really interesting in getting this so that I know if we already have an existing test method in the test class that already covers what I am trying to modify and hence, I can just update that method. If it is not covered yet, then, I will have to write up a new junit test for that. Thanks in advance for your kind assistance! > when localizers fail due to resource timestamps being out, provide more > diagnostics > --- > > Key: YARN-5277 > URL: https://issues.apache.org/jira/browse/YARN-5277 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Siddharth Ahuja >Priority: Major > > When an NM fails a resource D/L as the timestamps are wrong, there's not much > info, just two long values. > It would be good to also include the local time values, *and the current wall > time*. These are the things people need to know when trying to work out what > went wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5277) when localizers fail due to resource timestamps being out, provide more diagnostics
[ https://issues.apache.org/jira/browse/YARN-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065269#comment-17065269 ] Siddharth Ahuja edited comment on YARN-5277 at 3/24/20, 2:56 AM: - Hi [~aajisaka], I am working on this JIRA and have a potential fix/implementation in terms of non-test source code. However, I did have a question regarding the Junit code coverage tool -> _Clover_ . I tried to run the following command: {code} mvn test -Pclover {code} but it resulted in the following error: {code} Failed to execute goal com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on project hadoop-main: Failed to load resource as file [/Users//.clover.license]: Could not find resource '/Users//.clover.license'. -> [Help 1] that I tried to run to see if we are already covering the impacted code through Junit testing or not. I used the following command to run it: {code} I could try and supply a clover license through : {code} mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license] {code} as per https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, however, I need the clover.license. I somehow found a link where I could get that potentially - https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license but as I am not a committer, I don't have the credentials (I get asked for username/password). As such, can you kindly help me with a clover license? I am really interesting in getting this so that I know if we already have an existing test method in the test class that already covers what I am trying to modify and hence, I can just update that method. If it is not covered yet, then, I will have to write up a new junit test for that. I don't want to be reviewing multiple existing test methods to understand if something is covered or not as this approach is not robust. Thanks in advance for your kind assistance! was (Author: sahuja): Hi [~aajisaka], I am working on this JIRA and have a potential fix/implementation in terms of non-test source code. However, I did have a question regarding the Junit code coverage tool -> _Clover_ . I tried to run the following command: {code} mvn test -Pclover {code} but it resulted in the following error: {code} Failed to execute goal com.atlassian.maven.plugins:maven-clover2-plugin:3.3.0:setup (clover-setup) on project hadoop-main: Failed to load resource as file [/Users//.clover.license]: Could not find resource '/Users//.clover.license'. -> [Help 1] that I tried to run to see if we are already covering the impacted code through Junit testing or not. I used the following command to run it: {code} I could try and supply a clover license through : {code} mvn test -Pclover [-DcloverLicenseLocation=${user.name}/.clover.license] {code} as per https://svn.apache.org/repos/asf/hadoop/common/branches/MR-4327/BUILDING.txt, however, I need the clover.license. I somehow found a link where I could get that potentially - https://svn.apache.org/repos/private/committers/donated-licenses/clover/2.6.x/clover.license but as I am not a committer, I don't have the credentials (I get asked for username/password). As such, can you kindly help me with a clover license? I am really interesting in getting this so that I know if we already have an existing test method in the test class that already covers what I am trying to modify and hence, I can just update that method. If it is not covered yet, then, I will have to write up a new junit test for that. Thanks in advance for your kind assistance! > when localizers fail due to resource timestamps being out, provide more > diagnostics > --- > > Key: YARN-5277 > URL: https://issues.apache.org/jira/browse/YARN-5277 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Siddharth Ahuja >Priority: Major > > When an NM fails a resource D/L as the timestamps are wrong, there's not much > info, just two long values. > It would be good to also include the local time values, *and the current wall > time*. These are the things people need to know when trying to work out what > went wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10000) Code cleanup in FSSchedulerConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114592#comment-17114592 ] Siddharth Ahuja commented on YARN-1: Hi [~BilwaST], thanks for checking, however, I intend to work on all of my JIRAs in the near future. > Code cleanup in FSSchedulerConfigurationStore > - > > Key: YARN-1 > URL: https://issues.apache.org/jira/browse/YARN-1 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Minor > > Some things could be improved: > * In initialize: PathFilter can be replaced with lambda > * initialize is long, could be split into smaller methods > * In method 'format': for-loop can be replaced with foreach > * There's a variable with a typo: lastestConfigPath > * Add explanation of unimplemented methods > * Abstract Filesystem operations away more: > * Bad logging: Format string is combined with exception logging. > {code:java} > LOG.info("Failed to write config version at {}", configVersionFile, e); > {code} > * Interestingly phrased log messages like "write temp capacity configuration > fail" "write temp capacity configuration successfully, schedulerConfigFile=" > * Method "writeConfigurationToFileSystem" could be private > * Any other code quality improvements -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment
[ https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10416: -- Assignee: Siddharth Ahuja > Typos in YarnScheduler#allocate method's doc comment > > > Key: YARN-10416 > URL: https://issues.apache.org/jira/browse/YARN-10416 > Project: Hadoop YARN > Issue Type: Bug > Components: docs >Reporter: Wanqiang Ji >Assignee: Siddharth Ahuja >Priority: Minor > Labels: newbie > > {code:java} > /** > * The main api between the ApplicationMaster and the Scheduler. > * The ApplicationMaster is updating his future resource requirements > * and may release containers he doens't need. > */ > {code} > > doens't correct to doesn't -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment
[ https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189076#comment-17189076 ] Siddharth Ahuja commented on YARN-10416: No tests required as the updates are only javadoc-related. > Typos in YarnScheduler#allocate method's doc comment > > > Key: YARN-10416 > URL: https://issues.apache.org/jira/browse/YARN-10416 > Project: Hadoop YARN > Issue Type: Bug > Components: docs >Reporter: Wanqiang Ji >Assignee: Siddharth Ahuja >Priority: Minor > Labels: newbie > Attachments: YARN-10416.001.patch > > > {code:java} > /** > * The main api between the ApplicationMaster and the Scheduler. > * The ApplicationMaster is updating his future resource requirements > * and may release containers he doens't need. > */ > {code} > > doens't correct to doesn't -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment
[ https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10416: --- Attachment: YARN-10416.001.patch > Typos in YarnScheduler#allocate method's doc comment > > > Key: YARN-10416 > URL: https://issues.apache.org/jira/browse/YARN-10416 > Project: Hadoop YARN > Issue Type: Bug > Components: docs >Reporter: Wanqiang Ji >Assignee: Siddharth Ahuja >Priority: Minor > Labels: newbie > Attachments: YARN-10416.001.patch > > > {code:java} > /** > * The main api between the ApplicationMaster and the Scheduler. > * The ApplicationMaster is updating his future resource requirements > * and may release containers he doens't need. > */ > {code} > > doens't correct to doesn't -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment
[ https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10416: --- Attachment: (was: YARN-10416.001.patch) > Typos in YarnScheduler#allocate method's doc comment > > > Key: YARN-10416 > URL: https://issues.apache.org/jira/browse/YARN-10416 > Project: Hadoop YARN > Issue Type: Bug > Components: docs >Reporter: Wanqiang Ji >Assignee: Siddharth Ahuja >Priority: Minor > Labels: newbie > Attachments: YARN-10416.001.patch > > > {code:java} > /** > * The main api between the ApplicationMaster and the Scheduler. > * The ApplicationMaster is updating his future resource requirements > * and may release containers he doens't need. > */ > {code} > > doens't correct to doesn't -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10416) Typos in YarnScheduler#allocate method's doc comment
[ https://issues.apache.org/jira/browse/YARN-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188991#comment-17188991 ] Siddharth Ahuja commented on YARN-10416: * Fixed up the overall method description, * Added explanation for individual params, * The {{updateRequests}} param's explanation was incorrectly set to the return type: {code} updateRequests - @return the Allocation for the application {code} Fixed this so that updateRequests has its own explanation and the return type is moved on to its own line. > Typos in YarnScheduler#allocate method's doc comment > > > Key: YARN-10416 > URL: https://issues.apache.org/jira/browse/YARN-10416 > Project: Hadoop YARN > Issue Type: Bug > Components: docs >Reporter: Wanqiang Ji >Assignee: Siddharth Ahuja >Priority: Minor > Labels: newbie > > {code:java} > /** > * The main api between the ApplicationMaster and the Scheduler. > * The ApplicationMaster is updating his future resource requirements > * and may release containers he doens't need. > */ > {code} > > doens't correct to doesn't -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:45 AM: - Testing done on the platform: +* 1. Test Jstack collection for non-RUNNING app:*+ a. Ensure there is a YARN application that is already present from a previous run and is NOT currently RUNNING. b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the non-running app. Jstack button should be visible. c. Click on Jstack button. Error message should be displayed -> "Jstack cannot be collected for an application that is not running." because it is not possible to collect Jstack for a non-running application as it has no running containers. +* 2. Test for Jstack collection for a RUNNING app:*+ a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Repeat step e. from above for another container. A thread dump should be captured and visible in the panel containing the stdout logs. g. Go back and repeat step e. for the same container that was first selected. Notice that 2 thread dumps are now present in the stdout logs with the latest thread dump shown later in the stdout logs. +* 3. Error checking - Jstack fetch attempt for a container that is not running due to killed application:*+ a. Kill the currently RUNNING application using: yarn application -kill , b. Now try selecting a container from the drop-down containing containers listing. Jstack collection is not possible and hence the error is displayed -> "Jstack fetch failed for container: due to: “Trying to signal an absent container ”. * 4. Error checking - Jstack fetch attempt for a container while RMs/NMs not available:* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Stop the ResourceManager/s. g. Select a different container from the drop-down list. An error should be displayed -> "Jstack fetch failed for container: due to: “Error: Not able to connect to YARN!”". h. Restart the ResourceManager/s. i. Repeat steps a. until e. j. Stop NodeManager/s. k. Select a different container from the drop-down list. An error should be displayed -> "Logs fetch failed for container: due to: “Error: Not able to connect to YARN!”". l. Start back the NodeManager/s. *+ 5. Check latest (and the ONLY) running app attempt id is displayed:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Now, run the following command to terminate the currently running AM: yarn container -signal GRACEFUL_SHUTDOWN e. Run the following command to check the currently running app_attempt_id: yarn applicationattempt -list
[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja commented on YARN-1806: --- This JIRA implements a feature for the addition of a "Jstack" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be "Home / Applications / App [app_id] / Jstack") to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: API calls made when the Jstack button is clicked: 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. API calls made when the app attempt is selected from the drop-down: 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. API calls made when the container is selected from the drop-down: 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. > webUI update to allow end users to request thread dump > -- > > Key: YARN-1806 > URL: https://issues.apache.org/jira/browse/YARN-1806 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ming Ma >Assignee: Siddharth Ahuja >Priority: Major > > Both individual container gage and containers page will support this. After > end user clicks on the request link, they can follow to get to stdout page > for the thread dump content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:48 AM: - This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the Jstack button is clicked:+ 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. +API calls made when the app attempt is selected from the drop-down:+ 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. +API calls made when the container is selected from the drop-down:+ 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. 5. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. was (Author: sahuja): This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the Jstack button is clicked:+ 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:47 AM: - Testing done on the platform: *+1. Test Jstack collection for non-RUNNING app:+* a. Ensure there is a YARN application that is already present from a previous run and is NOT currently RUNNING. b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the non-running app. Jstack button should be visible. c. Click on Jstack button. Error message should be displayed -> "Jstack cannot be collected for an application that is not running." because it is not possible to collect Jstack for a non-running application as it has no running containers. *+2. Test for Jstack collection for a RUNNING app:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Repeat step e. from above for another container. A thread dump should be captured and visible in the panel containing the stdout logs. g. Go back and repeat step e. for the same container that was first selected. Notice that 2 thread dumps are now present in the stdout logs with the latest thread dump shown later in the stdout logs. *+3. Error checking - Jstack fetch attempt for a container that is not running due to killed application:+* a. Kill the currently RUNNING application using: yarn application -kill , b. Now try selecting a container from the drop-down containing containers listing. Jstack collection is not possible and hence the error is displayed -> "Jstack fetch failed for container: due to: “Trying to signal an absent container ”. *+4. Error checking - Jstack fetch attempt for a container while RMs/NMs not available:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Stop the ResourceManager/s. g. Select a different container from the drop-down list. An error should be displayed -> "Jstack fetch failed for container: due to: “Error: Not able to connect to YARN!”". h. Restart the ResourceManager/s. i. Repeat steps a. until e. j. Stop NodeManager/s. k. Select a different container from the drop-down list. An error should be displayed -> "Logs fetch failed for container: due to: “Error: Not able to connect to YARN!”". l. Start back the NodeManager/s. *+5. Check latest (and the ONLY) running app attempt id is displayed:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Now, run the following command to terminate the currently running AM: yarn container -signal GRACEFUL_SHUTDOWN e. Run the following command to check the currently running app_attempt_id: yarn applicationattempt -list application_1598288770104_0003
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:43 AM: - This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the Jstack button is clicked:+ 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. +API calls made when the app attempt is selected from the drop-down:+ 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. +API calls made when the container is selected from the drop-down:+ 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. was (Author: sahuja): This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: API calls made when the Jstack button is clicked: 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 7:43 AM: - This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: API calls made when the Jstack button is clicked: 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. API calls made when the app attempt is selected from the drop-down: 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. API calls made when the container is selected from the drop-down: 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. was (Author: sahuja): This JIRA implements a feature for the addition of a "Jstack" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be "Home / Applications / App [app_id] / Jstack") to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: API calls made when the Jstack button is clicked: 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). If
[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823 ] Siddharth Ahuja commented on YARN-1806: --- Testing done on the platform: 1. Test Jstack collection for non-RUNNING app: a. Ensure there is a YARN application that is already present from a previous run and is NOT currently RUNNING. b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the non-running app. Jstack button should be visible. c. Click on Jstack button. Error message should be displayed -> "Jstack cannot be collected for an application that is not running." because it is not possible to collect Jstack for a non-running application as it has no running containers. 2. Test for Jstack collection for a RUNNING app: a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Repeat step e. from above for another container. A thread dump should be captured and visible in the panel containing the stdout logs. g. Go back and repeat step e. for the same container that was first selected. Notice that 2 thread dumps are now present in the stdout logs with the latest thread dump shown later in the stdout logs. 3. Error checking - Jstack fetch attempt for a container that is not running due to killed application: a. Kill the currently RUNNING application using: yarn application -kill , b. Now try selecting a container from the drop-down containing containers listing. Jstack collection is not possible and hence the error is displayed -> "Jstack fetch failed for container: due to: “Trying to signal an absent container ”. 4. Error checking - Jstack fetch attempt for a container while RMs/NMs not available: a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Stop the ResourceManager/s. g. Select a different container from the drop-down list. An error should be displayed -> "Jstack fetch failed for container: due to: “Error: Not able to connect to YARN!”". h. Restart the ResourceManager/s. i. Repeat steps a. until e. j. Stop NodeManager/s. k. Select a different container from the drop-down list. An error should be displayed -> "Logs fetch failed for container: due to: “Error: Not able to connect to YARN!”". l. Start back the NodeManager/s. 5. Check latest (and the ONLY) running app attempt id is displayed: a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Now, run the following command to terminate the currently running AM: yarn container -signal GRACEFUL_SHUTDOWN e. Run the following command to check the currently running app_attempt_id: yarn applicationattempt -list application_1598288770104_0003 f.
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183919#comment-17183919 ] Siddharth Ahuja edited comment on YARN-1806 at 8/25/20, 10:36 AM: -- Submitting the initial patch for your review [~akhilpb]. was (Author: sahuja): Submitting the initial patch. > webUI update to allow end users to request thread dump > -- > > Key: YARN-1806 > URL: https://issues.apache.org/jira/browse/YARN-1806 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ming Ma >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-1806.001.patch > > > Both individual container gage and containers page will support this. After > end user clicks on the request link, they can follow to get to stdout page > for the thread dump content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:24 AM: - This JIRA implements a feature for the addition of a "*Threaddump*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the _Threaddump_ button is clicked:+ {code} 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). {code} If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. +API calls made when the app attempt is selected from the drop-down:+ {code} 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. {code} +API calls made when the container is selected from the drop-down:+ {code} 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. 5. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. {code} was (Author: sahuja): This JIRA implements a feature for the addition of a "*Jstack*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Jstack}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the Jstack button is clicked:+ 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183819#comment-17183819 ] Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:24 AM: - This JIRA implements a feature for the addition of a "*Threaddump*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the _Threaddump_ button is clicked:+ {code} 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]). {code} If the application is not RUNNING, then, there will be an error displayed for that based on info from 1. above. If the application is RUNNING, then, by checking the application attempts info for this app (there can be more than one app attempt), we display the application attempt id for the RUNNING attempt only. This is based on the info from 2. above. +API calls made when the app attempt is selected from the drop-down:+ {code} 3. http://:8088/ws/v1/cluster/apps//appattempts//containers -> This is to get the list of running containers for the currently running app attempt from the RM. {code} +API calls made when the container is selected from the drop-down:+ {code} 4. http://:8088/ws/v1/cluster/containers//signal/OUTPUT_THREAD_DUMP?user.name= -> This is for RM (that eventually calls NM through NM heartbeat) to send a SIGQUIT signal to the container process for the selected container ([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is essentially a kill -3 and it generates a thread dump that are captured in the stdout logs of the container. 5. http://:8042/ws/v1/node/containerlogs//stdout -> This is for the NM that is running the selected container to acquire the stdout logs from this running container that contains the thread dump by the above call. {code} was (Author: sahuja): This JIRA implements a feature for the addition of a "*Threaddump*" button on the ResourceManager Web UI's individual application page accessible by visiting RM Web UI -> Applications -> Click on (So, the breadcrumb would be {{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps for running YARN containers for a currently running application attempt. The thread dumps are captured as part of the stdout logs for the selected container and displayed as-is by querying the NodeManager node on which this container ran on. As part of this feature, there are 2 panels implemented. The first panel displays two drop-downs, the first one displaying the currently running app attempt id and a "None" option (similar to "Logs" functionality). Once this is selected, it goes on to display another drop-down in the same panel that contains a listing of currently running containers for this application attempt id. Once you select a container id from this second drop-down, another Panel is opened just below (again this is similar to the "Logs" functionality) that shows the selected attempt id and the container as the header with container's stdout logs also being displayed containing the thread dump that was triggered when the container was selected. Following sets of API calls are made: +API calls made when the _Threaddump_ button is clicked:+ {code} 1. http://:8088/ws/v1/cluster/apps/ -> Get application info e.g. app state from RM, 2. http://:8088/ws/v1/cluster/apps//appattempts -> Get application attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING or not
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823 ] Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:27 AM: - Testing done on the platform: *+1. Test Threaddump collection for non-RUNNING app:+* a. Ensure there is a YARN application that is already present from a previous run and is NOT currently RUNNING. b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the non-running app. Jstack button should be visible. c. Click on Threaddump button. Error message should be displayed -> "Threaddump cannot be collected for an application that is not running." because it is not possible to collect Jstack for a non-running application as it has no running containers. *+2. Test for Threaddump collection for a RUNNING app:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Threaddump button should be visible. c. Click on Threaddump button. A new Threaddump panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Repeat step e. from above for another container. A thread dump should be captured and visible in the panel containing the stdout logs. g. Go back and repeat step e. for the same container that was first selected. Notice that 2 thread dumps are now present in the stdout logs with the latest thread dump shown later in the stdout logs. *+3. Error checking - Jstack fetch attempt for a container that is not running due to killed application:+* a. Kill the currently RUNNING application using: yarn application -kill , b. Now try selecting a container from the drop-down containing containers listing. Jstack collection is not possible and hence the error is displayed -> "Threaddump fetch failed for container: due to: “Trying to signal an absent container ”. *+4. Error checking -Threaddump fetch attempt for a container while RMs/NMs not available:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Threaddump button should be visible. c. Click on Threaddump button. A new Threaddump panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Stop the ResourceManager/s. g. Select a different container from the drop-down list. An error should be displayed -> "Threaddump fetch failed for container: due to: “Error: Not able to connect to YARN!”". h. Restart the ResourceManager/s. i. Repeat steps a. until e. j. Stop NodeManager/s. k. Select a different container from the drop-down list. An error should be displayed -> "Logs fetch failed for container: due to: “Error: Not able to connect to YARN!”". l. Start back the NodeManager/s. *+5. Check latest (and the ONLY) running app attempt id is displayed:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Jstack button should be visible. c. Click on Jstack button. A new Jstack panel with a drop-down that has the options - "None" and "" should be shown, d. Now, run the following command to terminate the currently running AM: yarn container -signal GRACEFUL_SHUTDOWN e. Run the following command to check the currently running app_attempt_id: yarn
[jira] [Comment Edited] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183823#comment-17183823 ] Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:29 AM: - Testing done on the platform: *+1. Test Threaddump collection for non-RUNNING app:+* a. Ensure there is a YARN application that is already present from a previous run and is NOT currently RUNNING. b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the non-running app. Threaddump button should be visible. c. Click on Threaddump button. Error message should be displayed -> "Threaddump cannot be collected for an application that is not running." because it is not possible to collect Threaddump for a non-running application as it has no running containers. *+2. Test for Threaddump collection for a RUNNING app:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Threaddump button should be visible. c. Click on Threaddump button. A new Threaddump panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Repeat step e. from above for another container. A thread dump should be captured and visible in the panel containing the stdout logs. g. Go back and repeat step e. for the same container that was first selected. Notice that 2 thread dumps are now present in the stdout logs with the latest thread dump shown later in the stdout logs. *+3. Error checking - Threaddump fetch attempt for a container that is not running due to killed application:+* a. Kill the currently RUNNING application using: yarn application -kill , b. Now try selecting a container from the drop-down containing containers listing. Threaddump collection is not possible and hence the error is displayed -> "Threaddump fetch failed for container: due to: “Trying to signal an absent container ”. *+4. Error checking -Threaddump fetch attempt for a container while RMs/NMs not available:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Threaddump button should be visible. c. Click on Threaddump button. A new Threaddump panel with a drop-down that has the options - "None" and "" should be shown, d. Select the currently running app attempt from the drop-down. A new drop-down that shows currently running containers for this app attempt should be shown in the drop-down panel, e. Select a container from this drop-down. A new panel with the header that shows the selected container and select attempt-id should be shown along with Stdout logs for this container containing the thread dump from this container. f. Stop the ResourceManager/s. g. Select a different container from the drop-down list. An error should be displayed -> "Threaddump fetch failed for container: due to: “Error: Not able to connect to YARN!”". h. Restart the ResourceManager/s. i. Repeat steps a. until e. j. Stop NodeManager/s. k. Select a different container from the drop-down list. An error should be displayed -> "Logs fetch failed for container: due to: “Error: Not able to connect to YARN!”". l. Start back the NodeManager/s. *+5. Check latest (and the ONLY) running app attempt id is displayed:+* a. Ensure there is a YARN application that is currently in RUNNING state, b. Visit ResourceManager Web UI -> Applications -> Click on application_id link for the running app. Threaddump button should be visible. c. Click on Threaddump button. A new Threaddump panel with a drop-down that has the options - "None" and "" should be shown, d. Now, run the following command to terminate the currently running AM: yarn container -signal GRACEFUL_SHUTDOWN e. Run the following command to check the currently running app_attempt_id:
[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170462#comment-17170462 ] Siddharth Ahuja edited comment on YARN-10381 at 8/4/20, 12:11 AM: -- Thanks [~BilwaST], I've fixed up the tests. Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for reminding. I am working on an update. was (Author: sahuja): Thanks [~BilwaST], I've fixed up the tests. Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for reminding. I am ready with the update, however, having some compilation failures on trunk probably coming from a different jira so I will wait before the next patch is uploaded. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170462#comment-17170462 ] Siddharth Ahuja commented on YARN-10381: Thanks [~BilwaST], I've fixed up the tests. Thanks [~prabhujoseph], indeed, need to update the docs too, thanks for reminding. I am ready with the update, however, having some compilation failures on trunk probably coming from a different jira so I will wait before the next patch is uploaded. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10381: --- Attachment: YARN-10381.003.patch > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch, > YARN-10381.003.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171200#comment-17171200 ] Siddharth Ahuja commented on YARN-10381: Thanks [~prabhujoseph]! > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10381.001.patch, YARN-10381.002.patch, > YARN-10381.003.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10381: --- Attachment: YARN-10381.002.patch > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159 ] Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:34 PM: -- Before this change, the following REST API call to RM: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706 {code} produced the following output: {code} 1 1596231023017 0 container_1596230988596_0001_01_01 localhost:8042 localhost:61871 http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin appattempt_1596230988596_0001_01 null {code} Notice above that there is no state element for the application attempt. Update for this jira (my change) involves adding appAttemptState to AppAttemptInfo object. Tested this on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null RUNNING {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. Thanks to [~prabhujoseph] for the hint. was (Author: sahuja): Before this change, the following REST API call to RM: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706 {code} produced the following output: {code} 1 1596231023017 0 container_1596230988596_0001_01_01 localhost:8042 localhost:61871 http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin appattempt_1596230988596_0001_01 null {code} Notice above that there is no state element for the application attempt. Update for this jira (my change) involves adding appAttemptState to AppAttemptInfo object. Tested this on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null RUNNING {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159 ] Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:34 PM: -- Before this change, the following REST API call to RM: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596230988596_0001/appattempts?_=1596231029706 {code} produced the following output: {code} 1 1596231023017 0 container_1596230988596_0001_01_01 localhost:8042 localhost:61871 http://localhost:8042/node/containerlogs/container_1596230988596_0001_01_01/sidtheadmin appattempt_1596230988596_0001_01 null {code} Notice above that there is no state element for the application attempt. Update for this jira (my change) involves adding appAttemptState to AppAttemptInfo object. Tested this on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null RUNNING {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. was (Author: sahuja): Added appAttemptState to AppAttemptInfo object and tested on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null RUNNING {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169159#comment-17169159 ] Siddharth Ahuja edited comment on YARN-10381 at 7/31/20, 9:14 PM: -- Added appAttemptState to AppAttemptInfo object and tested on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null RUNNING {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. was (Author: sahuja): Added appAttemptState to AppAttemptInfo object and tested on single node cluster by visiting http://localhost:8088/ui2 and inspecting the REST API call: {code} http://localhost:8088/ws/v1/cluster/apps/application_1596229056065_0002/appattempts?_=1596229900909 {code} in browser: {code} 1 1596229888259 0 container_1596229056065_0002_01_01 localhost:8042 localhost:54250 http://localhost:8042/node/containerlogs/container_1596229056065_0002_01_01/sidtheadmin appattempt_1596229056065_0002_01 null *RUNNING* {code} It can be seen from above that the response contains appAttemptState which is RUNNING for a currently running attempt. I did not find any specific tests for any attributes e.g. logsLink etc. Considering this is just a minor update, not sure if any junit testing is required. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-1806: - Assignee: Siddharth Ahuja > webUI update to allow end users to request thread dump > -- > > Key: YARN-1806 > URL: https://issues.apache.org/jira/browse/YARN-1806 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ming Ma >Assignee: Siddharth Ahuja >Priority: Major > > Both individual container gage and containers page will support this. After > end user clicks on the request link, they can follow to get to stdout page > for the thread dump content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
Siddharth Ahuja created YARN-10381: -- Summary: Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call Key: YARN-10381 URL: https://issues.apache.org/jira/browse/YARN-10381 Project: Hadoop YARN Issue Type: Improvement Reporter: Siddharth Ahuja The [ApplicationAttempts RM REST API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] : {code} http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts {code} returns a collection of Application Attempt objects, where each application attempt object contains elements like id, nodeId, startTime etc. This JIRA has been raised to send out Application Attempt state as well as part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10381: -- Assignee: Siddharth Ahuja > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10381: --- Component/s: yarn-ui-v2 > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10381: --- Affects Version/s: 3.3.0 > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9454) Add detailed log about list applications command
[ https://issues.apache.org/jira/browse/YARN-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-9454: - Assignee: Siddharth Ahuja > Add detailed log about list applications command > > > Key: YARN-9454 > URL: https://issues.apache.org/jira/browse/YARN-9454 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Siddharth Ahuja >Priority: Major > > When a user lists YARN applications with the RM admin CLI, we have one audit > log here > (https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L924) > However, a more extensive logging could be added. > This is the call chain, when such a list command got executed (from bottom to > top): > {code:java} > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl#getApplications(java.util.Set, > java.util.EnumSet, > java.util.Set) > ApplicationCLI.listApplications(Set, EnumSet, > Set) (org.apache.hadoop.yarn.client.cli) > ApplicationCLI.run(String[]) (org.apache.hadoop.yarn.client.cli) > {code} > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications: > This is the place that fits perfectly for adding a more detailed log message > about the request or the response (or both). > In my opinion, a trace (or debug) level log would be great at the end of this > method, logging the whole response, so any potential issues with the code can > be troubleshot more easily. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251020#comment-17251020 ] Siddharth Ahuja commented on YARN-10528: Thank you [~snemeth]! Please take your time. > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253159#comment-17253159 ] Siddharth Ahuja commented on YARN-10528: Hi [~snemeth], trunk and 3.3 are all good, whereas, test failures coming from 3.2 and 3.1 are not related to my changes. As such, I believe I am good here. Please feel free to review when you get a chance, thanks! > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.1.001.patch, > YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, > YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.
[ https://issues.apache.org/jira/browse/YARN-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10545: -- Assignee: Siddharth Ahuja > Improve the readability of diagnostics log in yarn-ui2 web page. > > > Key: YARN-10545 > URL: https://issues.apache.org/jira/browse/YARN-10545 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Reporter: akiyamaneko >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: Diagnostics shows unreadble.png > > > If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces > will not be displayed, which is hard to read. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528-branch-3.1.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.1.001.patch, > YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, > YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252632#comment-17252632 ] Siddharth Ahuja commented on YARN-10528: Hey [~snemeth], Nice catch! Indeed, if there is no exception thrown from the source code even with a test setup that is designed to have the exception thrown (because maxAMShare is defined inside a parent queue), then, the tests would still pass incorrectly because there is no fail(...) in test case logic to double check the "throwing of the exception". As such, these tests will fail to identify a bad update to the source code that may not result in any exception. I have gone ahead and updated the tests as per your suggestion now. In regards to doing a backport to the earlier branches, I will apply the patch to these now and run the JUnits. Once these are passed I will upload the patches for the relative branches as well soon. Thanks again for reviewing! > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare > for root.users (parent queue) has no effect as child queue does not inherit > it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528-branch-3.2.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.2.001.patch, > YARN-10528-branch-3.3.001.patch, YARN-10528.001.patch, YARN-10528.002.patch, > maxAMShare for root.users (parent queue) has no effect as child queue does > not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528.002.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare > for root.users (parent queue) has no effect as child queue does not inherit > it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528-branch-3.3.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.3.001.patch, YARN-10528.001.patch, > YARN-10528.002.patch, maxAMShare for root.users (parent queue) has no effect > as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528-branch-3.3.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.1.001.patch, > YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, > YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: (was: YARN-10528-branch-3.3.001.patch) > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528-branch-3.1.001.patch, > YARN-10528-branch-3.2.001.patch, YARN-10528-branch-3.3.001.patch, > YARN-10528.001.patch, YARN-10528.002.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250694#comment-17250694 ] Siddharth Ahuja commented on YARN-10528: Above failures have nothing to do with my patch, just gonna wait until issues with the pre-commit build are fixed for other deliveries here - https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/. > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: (was: YARN-10528.001.patch) > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: maxAMShare for root.users (parent queue) has no effect > as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
Siddharth Ahuja created YARN-10528: -- Summary: maxAMShare should only be accepted for leaf queues, not parent queues Key: YARN-10528 URL: https://issues.apache.org/jira/browse/YARN-10528 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Ahuja Based on [Hadoop documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], it is clear that {{maxAMShare}} property can only be used for *leaf queues*. This is similar to the {{reservation}} setting. However, existing code only ensures that the reservation setting is not accepted for "parent" queues (see https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 and https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) but it is missing the checks for {{maxAMShare}}. Due to this, it is current possible to have an allocation similar to below: {code} 1.0 drf * * 1.0 drf 1.0 drf 1.0 fair {code} where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the queue's resources for Application Masters. Notice above that root.users is a parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary to the documentation and in fact, it is very misleading because the child queues like root.users. actually do not inherit this setting at all and they still go on and use the default of 0.5 instead of 1.0, see the attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10528: -- Assignee: Siddharth Ahuja > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is current > possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Description: Based on [Hadoop documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], it is clear that {{maxAMShare}} property can only be used for *leaf queues*. This is similar to the {{reservation}} setting. However, existing code only ensures that the reservation setting is not accepted for "parent" queues (see https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 and https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) but it is missing the checks for {{maxAMShare}}. Due to this, it is currently possible to have an allocation similar to below: {code} 1.0 drf * * 1.0 drf 1.0 drf 1.0 fair {code} where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the queue's resources for Application Masters. Notice above that root.users is a parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary to the documentation and in fact, it is very misleading because the child queues like root.users. actually do not inherit this setting at all and they still go on and use the default of 0.5 instead of 1.0, see the attached screenshot as an example. was: Based on [Hadoop documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], it is clear that {{maxAMShare}} property can only be used for *leaf queues*. This is similar to the {{reservation}} setting. However, existing code only ensures that the reservation setting is not accepted for "parent" queues (see https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 and https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) but it is missing the checks for {{maxAMShare}}. Due to this, it is current possible to have an allocation similar to below: {code} 1.0 drf * * 1.0 drf 1.0 drf 1.0 fair {code} where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the queue's resources for Application Masters. Notice above that root.users is a parent queue, however, it still gladly accepts {{maxAMShare}}. This is contrary to the documentation and in fact, it is very misleading because the child queues like root.users. actually do not inherit this setting at all and they still go on and use the default of 0.5 instead of 1.0, see the attached screenshot as an example. > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: maxAMShare for root.users (parent queue) has no effect > as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and >
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: maxAMShare for root.users (parent queue) has no effect as child queue does not inherit it.png > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: maxAMShare for root.users (parent queue) has no effect > as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is current > possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153 ] Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:52 AM: --- I have made the behaviour similar to the {{reservation}} element in code. Performed the following testing on the single node cluster: Have FS XML as follows: {code} 1.0 drf * * 1.0 drf 1.0 drf 0.76 <- root.users is a parent queue with maxAMShare set. This should not be possible. 1.0 drf 1.0 drf 1.0 drf 1.0 drf fair 0.75 {code} Refresh YARN queues and observe the RM logs: {code} % bin/yarn rmadmin -refreshQueues {code} {code} 2020-12-16 18:12:29,665 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128) at java.lang.Thread.run(Thread.java:748) 2020-12-16 18:15:04,056 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409) at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120) at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) {code} Now, update FS XML such that {{maxAMShare}} is not set for root.users but set for a parent queue which is not explicitly tagged as one with "type=parent": {code} 1.0 drf * * 1.0 drf 1.0
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: YARN-10528.001.patch > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent > queue) has no effect as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153 ] Siddharth Ahuja commented on YARN-10528: I have made the behaviour similar to the reservation element in code. Performed the following testing on the single node cluster: Have FS XML as follows: {code} 1.0 drf * * 1.0 drf 1.0 drf 0.76 <- root.users is a parent queue with maxAMShare set. This should not be possible. 1.0 drf 1.0 drf 1.0 drf 1.0 drf fair 0.75 {code} Refresh YARN queues and observe the RM logs: {code} % bin/yarn rmadmin -refreshQueues {code} {code} 2020-12-16 18:12:29,665 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128) at java.lang.Thread.run(Thread.java:748) 2020-12-16 18:15:04,056 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409) at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120) at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) {code} Now, update FS XML such that maxAMShare is not set for root.users but set for a parent queue which is not explicitly tagged as one with "type=parent": {code} 1.0 drf * * 1.0 drf 1.0 drf 1.0 drf
[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153 ] Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:51 AM: --- I have made the behaviour similar to the {{reservation}} element in code. Performed the following testing on the single node cluster: Have FS XML as follows: {code} 1.0 drf * * 1.0 drf 1.0 drf 0.76 <- root.users is a parent queue with maxAMShare set. This should not be possible. 1.0 drf 1.0 drf 1.0 drf 1.0 drf fair 0.75 {code} Refresh YARN queues and observe the RM logs: {code} % bin/yarn rmadmin -refreshQueues {code} {code} 2020-12-16 18:12:29,665 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Failed to reload fair scheduler config file - will use existing allocations. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128) at java.lang.Thread.run(Thread.java:748) 2020-12-16 18:15:04,056 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Failed to reload allocations file org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: The configuration settings for root.users are invalid. A queue element that contains child queue elements or that has the type='parent' attribute cannot also include a maxAMShare element. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409) at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120) at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) {code} Now, update FS XML such that {{maxAMShare}} is not set for root.users but set for a parent queue which is not explicitly tagged as one with "type=parent": {code} 1.0 drf * * 1.0 drf 1.0
[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues
[ https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10528: --- Attachment: (was: YARN-10528.001.patch) > maxAMShare should only be accepted for leaf queues, not parent queues > - > > Key: YARN-10528 > URL: https://issues.apache.org/jira/browse/YARN-10528 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: maxAMShare for root.users (parent queue) has no effect > as child queue does not inherit it.png > > > Based on [Hadoop > documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html], > it is clear that {{maxAMShare}} property can only be used for *leaf queues*. > This is similar to the {{reservation}} setting. > However, existing code only ensures that the reservation setting is not > accepted for "parent" queues (see > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233) > but it is missing the checks for {{maxAMShare}}. Due to this, it is > currently possible to have an allocation similar to below: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1.0 > drf > 1.0 > > > fair > > > > > > > > > {code} > where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the > queue's resources for Application Masters. Notice above that root.users is a > parent queue, however, it still gladly accepts {{maxAMShare}}. This is > contrary to the documentation and in fact, it is very misleading because the > child queues like root.users. actually do not inherit this setting at > all and they still go on and use the default of 0.5 instead of 1.0, see the > attached screenshot as an example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10552) Eliminate code duplication in SLSCapacityScheduler and SLSFairScheduler
[ https://issues.apache.org/jira/browse/YARN-10552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277991#comment-17277991 ] Siddharth Ahuja commented on YARN-10552: Hey [~snemeth], thanks a lot for the de-duplication here! Few comments from my side: # SLSSchedulerCommons - Can we please explicitly assign a default value for the declared fields like metricsOn etc. and not rely on Java to assign one, just as a good programming style. # Class variables - metricsOn & schedulerMetrics could be marked as private in SLSSchedulerCommons, new getters should be defined that could be invoked within the individual scheduler classes instead of referring them directly from a separate object. # The "Tracker" seems to be common to both schedulers as such we could move the declaration & initialization to the common SLSSchedulerCommons, implement getTracker() here that returns the tracker object and keep getTracker() in the individual schedulers (we have to, thanks to SchedulerWrapper) and just return the tracker by calling schedulerCommons.getTracker(). # //metrics off, //metrics on comments inside handle() in SLSSchedulerCommons don't seem to be adding much value so let's just remove them. # appQueueMap was not present in SLSFairScheduler before (it was in SLSCapacityScheduler) however from https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L163, it seems that the super class of the schedulers - https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java#L159 has this already. As such, do we really need to define a new map as a common map at all in SLSSchedulerCommons or can we somehow reuse the super class's map? It might need some code updates though. # In regards to the above point, considering SLSFairScheduler did not previously have any of the following code in handle() method: {code} AppAttemptRemovedSchedulerEvent appRemoveEvent = (AppAttemptRemovedSchedulerEvent) schedulerEvent; appQueueMap.remove(appRemoveEvent.getApplicationAttemptID()); } else if (schedulerEvent.getType() == SchedulerEventType.APP_ATTEMPT_ADDED && schedulerEvent instanceof AppAttemptAddedSchedulerEvent) { AppAttemptAddedSchedulerEvent appAddEvent = (AppAttemptAddedSchedulerEvent) schedulerEvent; SchedulerApplication app = (SchedulerApplication) scheduler.getSchedulerApplications().get(appAddEvent.getApplicationAttemptId() .getApplicationId()); appQueueMap.put(appAddEvent.getApplicationAttemptId(), app.getQueue() .getQueueName()); {code} Do you think this was a bug that wasn't earlier identified? > Eliminate code duplication in SLSCapacityScheduler and SLSFairScheduler > --- > > Key: YARN-10552 > URL: https://issues.apache.org/jira/browse/YARN-10552 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-10552.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10123) Error message around yarn app -stop/start can be improved to highlight that an implementation at framework level is needed for the stop/start functionality to work
[ https://issues.apache.org/jira/browse/YARN-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10123: --- Attachment: YARN-10123.branch-3.2.001.patch > Error message around yarn app -stop/start can be improved to highlight that > an implementation at framework level is needed for the stop/start > functionality to work > --- > > Key: YARN-10123 > URL: https://issues.apache.org/jira/browse/YARN-10123 > Project: Hadoop YARN > Issue Type: Improvement > Components: client, documentation >Affects Versions: 3.2.1 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10123.001.patch, YARN-10123.branch-3.2.001.patch > > > A "stop" on a YARN application fails with the below error: > {code} > # yarn app -stop application_1581294743321_0002 -appTypes SPARK > 20/02/10 06:24:27 INFO client.RMProxy: Connecting to ResourceManager at > c3224-node2.squadron.support.hortonworks.com/172.25.34.128:8050 > 20/02/10 06:24:27 INFO client.AHSProxy: Connecting to Application History > server at c3224-node2.squadron.support.hortonworks.com/172.25.34.128:10200 > Exception in thread "main" java.lang.IllegalArgumentException: App admin > client class name not specified for type SPARK > at > org.apache.hadoop.yarn.client.api.AppAdminClient.createAppAdminClient(AppAdminClient.java:76) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:579) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:123) > {code} > From > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L76, > it seems that this is because user does not have the setting: > {code} > yarn.application.admin.client.class.SPARK > {code} > set up in their client configuration. > However, even if this setting is present, we still need to have an > implementation available for the application type. From my internal > discussions - Jobs don't have a notion of stop / resume functionality at YARN > level. If some apps like Spark need it, it has to be implemented at those > framework's level. > Therefore, the above error message is a bit misleading in that, even if > "yarn.application.admin.client.class.SPARK" is supplied (or for that matter - > yarn.application.admin.client.class.MAPREDUCE), if there is no implementation > actually available underneath to handle the stop/start functionality then, we > will fail again, albeit with a different error here: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L85. > As such, maybe this error message can be potentially improved to say > something like: > {code} > Exception in thread "main" java.lang.IllegalArgumentException: App admin > client class name not specified for type SPARK. Please ensure the App admin > client class actually exists within SPARK to handle this functionality. > {code} > or something similar. > Further, documentation around "-stop" and "-start" options will need to be > improved here -> > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app > as it does not mention anything about having an implementation at the > framework level for the YARN stop/start command to succeed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10123) Error message around yarn app -stop/start can be improved to highlight that an implementation at framework level is needed for the stop/start functionality to work
[ https://issues.apache.org/jira/browse/YARN-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10123: --- Attachment: YARN-10123.branch-3.3.001.patch > Error message around yarn app -stop/start can be improved to highlight that > an implementation at framework level is needed for the stop/start > functionality to work > --- > > Key: YARN-10123 > URL: https://issues.apache.org/jira/browse/YARN-10123 > Project: Hadoop YARN > Issue Type: Improvement > Components: client, documentation >Affects Versions: 3.2.1 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10123.001.patch, YARN-10123.branch-3.2.001.patch, > YARN-10123.branch-3.3.001.patch > > > A "stop" on a YARN application fails with the below error: > {code} > # yarn app -stop application_1581294743321_0002 -appTypes SPARK > 20/02/10 06:24:27 INFO client.RMProxy: Connecting to ResourceManager at > c3224-node2.squadron.support.hortonworks.com/172.25.34.128:8050 > 20/02/10 06:24:27 INFO client.AHSProxy: Connecting to Application History > server at c3224-node2.squadron.support.hortonworks.com/172.25.34.128:10200 > Exception in thread "main" java.lang.IllegalArgumentException: App admin > client class name not specified for type SPARK > at > org.apache.hadoop.yarn.client.api.AppAdminClient.createAppAdminClient(AppAdminClient.java:76) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:579) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:123) > {code} > From > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L76, > it seems that this is because user does not have the setting: > {code} > yarn.application.admin.client.class.SPARK > {code} > set up in their client configuration. > However, even if this setting is present, we still need to have an > implementation available for the application type. From my internal > discussions - Jobs don't have a notion of stop / resume functionality at YARN > level. If some apps like Spark need it, it has to be implemented at those > framework's level. > Therefore, the above error message is a bit misleading in that, even if > "yarn.application.admin.client.class.SPARK" is supplied (or for that matter - > yarn.application.admin.client.class.MAPREDUCE), if there is no implementation > actually available underneath to handle the stop/start functionality then, we > will fail again, albeit with a different error here: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/AppAdminClient.java#L85. > As such, maybe this error message can be potentially improved to say > something like: > {code} > Exception in thread "main" java.lang.IllegalArgumentException: App admin > client class name not specified for type SPARK. Please ensure the App admin > client class actually exists within SPARK to handle this functionality. > {code} > or something similar. > Further, documentation around "-stop" and "-start" options will need to be > improved here -> > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app > as it does not mention anything about having an implementation at the > framework level for the YARN stop/start command to succeed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10770) container-executor permission is wrong in SecureContainer.md
[ https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10770: -- Assignee: Siddharth Ahuja > container-executor permission is wrong in SecureContainer.md > > > Key: YARN-10770 > URL: https://issues.apache.org/jira/browse/YARN-10770 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Assignee: Siddharth Ahuja >Priority: Major > Labels: newbie > > {noformat} > The `container-executor` program must be owned by `root` and have the > permission set `---sr-s---`. > {noformat} > It should be 6050 {noformat}---Sr-s---{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10770) container-executor permission is wrong in SecureContainer.md
[ https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10770: --- Attachment: YARN-10770.001.patch > container-executor permission is wrong in SecureContainer.md > > > Key: YARN-10770 > URL: https://issues.apache.org/jira/browse/YARN-10770 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Assignee: Siddharth Ahuja >Priority: Major > Labels: newbie > > {noformat} > The `container-executor` program must be owned by `root` and have the > permission set `---sr-s---`. > {noformat} > It should be 6050 {noformat}---Sr-s---{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10770) container-executor permission is wrong in SecureContainer.md
[ https://issues.apache.org/jira/browse/YARN-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10770: --- Attachment: (was: YARN-10770.001.patch) > container-executor permission is wrong in SecureContainer.md > > > Key: YARN-10770 > URL: https://issues.apache.org/jira/browse/YARN-10770 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Assignee: Siddharth Ahuja >Priority: Major > Labels: newbie > > {noformat} > The `container-executor` program must be owned by `root` and have the > permission set `---sr-s---`. > {noformat} > It should be 6050 {noformat}---Sr-s---{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10839) queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps setting to this value ignoring any individually overriden maxRunningApps setting for child queue
[ https://issues.apache.org/jira/browse/YARN-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10839: --- Component/s: yarn > queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps > setting to this value ignoring any individually overriden maxRunningApps > setting for child queues in FairScheduler > > > Key: YARN-10839 > URL: https://issues.apache.org/jira/browse/YARN-10839 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.5, 3.3.1 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > > [queueMaxAppsDefault|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format] > sets the default running app limit for queues (including the root queue) > which can be overridden by individual child queues through the maxRunningApps > setting. > Consider a simple FairScheduler XML as follows: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1024000 mb, 1000 vcores > 15 > 2.0 > drf > > > 512000 mb, 500 vcores > 10 > 1.0 > drf > > > 3 > drf > > > > > > {code} > Here: > * {{queueMaxAppsDefault}} is set to 3 {{maxRunningApps}} by default. > * root queue does not have any maxRunningApps limit set, > * maxRunningApps for child queues - root.A is 15 and for root.B is 10. > From above, if users wants to submit jobs to root.B, they are (incorrectly) > capped to 3, not 15 because the root queue (parent) itself is capped to 3 > because of the queueMaxAppsDefault setting. > Users' observations are thus seeing their apps stuck in ACCEPTED state. > Either the above FairScheduler XML should have been rejected by the > ResourceManager, or, the root queue should have been capped to the maximum > maxRunningApps setting defined for a leaf queue. > Possible solution -> If root queue has no maxRunningApps set and > queueMaxAppsDefault is set to a lower value than maxRunningApps for an > individual leaf queue, then, the root queue should implicitly be capped to > the latter, instead of queueMaxAppsDefault. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10839) queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps setting to this value ignoring any individually overriden maxRunningApps setting for child queue
[ https://issues.apache.org/jira/browse/YARN-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja updated YARN-10839: --- Labels: scheduler (was: ) > queueMaxAppsDefault when set blindly caps the root queue's maxRunningApps > setting to this value ignoring any individually overriden maxRunningApps > setting for child queues in FairScheduler > > > Key: YARN-10839 > URL: https://issues.apache.org/jira/browse/YARN-10839 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.5, 3.3.1 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Labels: scheduler > > [queueMaxAppsDefault|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format] > sets the default running app limit for queues (including the root queue) > which can be overridden by individual child queues through the maxRunningApps > setting. > Consider a simple FairScheduler XML as follows: > {code} > > > > 1.0 > drf > * > * > > 1.0 > drf > > > 1024000 mb, 1000 vcores > 15 > 2.0 > drf > > > 512000 mb, 500 vcores > 10 > 1.0 > drf > > > 3 > drf > > > > > > {code} > Here: > * {{queueMaxAppsDefault}} is set to 3 {{maxRunningApps}} by default. > * root queue does not have any maxRunningApps limit set, > * maxRunningApps for child queues - root.A is 15 and for root.B is 10. > From above, if users wants to submit jobs to root.B, they are (incorrectly) > capped to 3, not 15 because the root queue (parent) itself is capped to 3 > because of the queueMaxAppsDefault setting. > Users' observations are thus seeing their apps stuck in ACCEPTED state. > Either the above FairScheduler XML should have been rejected by the > ResourceManager, or, the root queue should have been capped to the maximum > maxRunningApps setting defined for a leaf queue. > Possible solution -> If root queue has no maxRunningApps set and > queueMaxAppsDefault is set to a lower value than maxRunningApps for an > individual leaf queue, then, the root queue should implicitly be capped to > the latter, instead of queueMaxAppsDefault. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org