[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077325#comment-17077325 ] Szilard Nemeth edited comment on YARN-10207 at 4/7/20, 3:10 PM: [~sahuja], Committed to trunk and branch-3.3. For branch-3.2, I uploaded a patch ( [^YARN-10207.branch-3.2.001.patch] ) to have a Jenkins results was (Author: snemeth): [~sahuja], Committed to trunk and branch-3.3. For branch-3.2, I uploaded a patch (https://issues.apache.org/jira/secure/attachment/12999248/YARN-10207.branch-3.2.001.patch) to have a Jenkins results > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Fix For: 3.3.0, 3.4.0 > > Attachments: YARN-10207.001.patch, YARN-10207.002.patch, > YARN-10207.003.patch, YARN-10207.004.patch, YARN-10207.branch-3.2.001.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at >
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957 ] Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:04 PM: -- Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings. I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment (not sure how much more I can tweak this). Thanks again! was (Author: sahuja): Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings. I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10207.001.patch, YARN-10207.002.patch, > YARN-10207.003.patch, YARN-10207.004.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957 ] Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:02 PM: -- Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings. I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! was (Author: sahuja): Hey [~adam.antal], thanks again for your review. I went ahead and updated my IDE indentation settings (see the attached screenshot). I updated the code slightly so that I fixed up the indentation as per the guidelines and also in some cases prevented the issue altogether. Let me know if this resolves your comment. Thanks again! > CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated > logs on the JobHistoryServer Web UI > - > > Key: YARN-10207 > URL: https://issues.apache.org/jira/browse/YARN-10207 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-10063.004.patch, YARN-10207.001.patch, > YARN-10207.002.patch, YARN-10207.003.patch > > > File descriptor leaks are observed coming from the JobHistoryServer process > while it tries to render a "corrupted" aggregated log on the JHS Web UI. > Issue reproduced using the following steps: > # Ran a sample Hadoop MR Pi job, it had the id - > application_1582676649923_0026. > # Copied an aggregated log file from HDFS to local FS: > {code} > hdfs dfs -get > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Updated the TFile metadata at the bottom of this file with some junk to > corrupt the file : > *Before:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP > {code} > *After:* > {code} > > ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah > {code} > Notice "blah" (junk) added at the very end. > # Remove the existing aggregated log file that will need to be replaced by > our modified copy from step 3 (as otherwise HDFS will prevent it from placing > the file with the same name as it already exists): > {code} > hdfs dfs -rm -r -f > /tmp/logs/systest/logs/application_1582676649923_0026/_8041 > {code} > # Upload the corrupted aggregated file back to HDFS: > {code} > hdfs dfs -put _8041 > /tmp/logs/systest/logs/application_1582676649923_0026 > {code} > # Visit HistoryServer Web UI > # Click on job_1582676649923_0026 > # Click on "logs" link against the AM (assuming the AM ran on nm_hostname) > # Review the JHS logs, following exception will be seen: > {code} > 2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error > getting logs for job_1582676649923_0026 > java.io.IOException: Not a valid BCFile. > at > org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) > at > org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628) > at > org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341) > at > org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at >
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:50 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. This is the call flow: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:36 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. The call happens in this fashion: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here:
[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI
[ https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542 ] Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:35 AM: -- Hi [~adam.antal], thanks for your comments. The leak happens when AggregatedLogFormat.LogReader is getting instantiated, specifically, when TFile.Reader creation within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted file passed in (see above stacktrace). The fact that FSDataInputStream is not closed out causes the leak. The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader in the finally clause (see https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153), however, it assumes that the reader would have been created successfully. However, in our case, the reader never manages to get created because it fails during construction phase itself due to a corrupted log. The fix, therefore, is to catch any IO Exceptions within AggregatedLogFormat.LogReader itself inside the constructor, perform a close of all the relevant entities including FSDataInputStream if we do indeed hit any and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so that it is able to catch it and log it (https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150). This ensures that we don't leak connections etc. wherever the reader fails to instantiate (=new AggregatedLogFormat.LogReader). Based on your feedback, I performed functional testing with IndexedFormat (IFile) by setting the following properties inside yarn-site.xml: {code} yarn.log-aggregation.file-formats IndexedFormat yarn.log-aggregation.file-controller.IndexedFormat.class org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController yarn.log-aggregation.IndexedFormat.remote-app-log-dir /tmp/ifilelogs yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix ifilelogs {code} Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and tried to render it in JHS Web UI, however, no leaks were found for this case. The call happens in this fashion: IndexedFileAggregatedLogsBlock.render() -> LogAggregationIndexedFileController.loadIndexedLogsMeta(…) IOException is encountered inside this try block, however, notice the finally clause here -> https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900. This helps cleaning up the socket connection by closing out FSDataInputStream. You will notice that this is a different call stack to the TFile case as we don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently. Regardless, thanks to that finally clause, it does end up cleaning the connection and there are no CLOSE_WAIT leaks in case of a corrupted log file being encountered. (Bad thing here is that only a WARN log is presented to the user in the JHS logs in case of rendering failing for Tfile logs and there is no stacktrace logged coming from the exception here - https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136 as the exception is just swallowed up inside the catch{} clause. This may warrant a separate JIRA.) As part of this fix, I looked for any occurrences of "new TFile.Reader" that may cause connection leaks somewhere else. I found two : # TFileDumper, see https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103, and, # FileSystemApplicationHistoryStore, see https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691 1 is not an issue because FSDataInputStream is getting closed inside finally{} clause here: