[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-07 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077325#comment-17077325
 ] 

Szilard Nemeth edited comment on YARN-10207 at 4/7/20, 3:10 PM:


[~sahuja],
Committed to trunk and branch-3.3.
For branch-3.2, I uploaded a patch ( [^YARN-10207.branch-3.2.001.patch] ) to 
have a Jenkins results


was (Author: snemeth):
[~sahuja],
Committed to trunk and branch-3.3.
For branch-3.2, I uploaded a patch 
(https://issues.apache.org/jira/secure/attachment/12999248/YARN-10207.branch-3.2.001.patch)
 to have a Jenkins results

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10207.001.patch, YARN-10207.002.patch, 
> YARN-10207.003.patch, YARN-10207.004.patch, YARN-10207.branch-3.2.001.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at 
> 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:04 PM:
--

Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings. I updated the code slightly so that I fixed up the 
indentation as per the guidelines and also in some cases prevented the issue 
altogether. Let me know if this resolves your comment (not sure how much more I 
can tweak this). Thanks again!


was (Author: sahuja):
Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings. I updated the code slightly so that I fixed up the 
indentation as per the guidelines and also in some cases prevented the issue 
altogether. Let me know if this resolves your comment. Thanks again!

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10207.001.patch, YARN-10207.002.patch, 
> YARN-10207.003.patch, YARN-10207.004.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-04-03 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074957#comment-17074957
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 4/3/20, 11:02 PM:
--

Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings. I updated the code slightly so that I fixed up the 
indentation as per the guidelines and also in some cases prevented the issue 
altogether. Let me know if this resolves your comment. Thanks again!


was (Author: sahuja):
Hey [~adam.antal], thanks again for your review. I went ahead and updated my 
IDE indentation settings (see the attached screenshot). I updated the code 
slightly so that I fixed up the indentation as per the guidelines and also in 
some cases prevented the issue altogether. Let me know if this resolves your 
comment. Thanks again!

> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -
>
> Key: YARN-10207
> URL: https://issues.apache.org/jira/browse/YARN-10207
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10063.004.patch, YARN-10207.001.patch, 
> YARN-10207.002.patch, YARN-10207.003.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}
>   
> ^@^GVERSION*(^@_1582676649923_0026_01_03^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put _8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>   2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>   java.io.IOException: Not a valid BCFile.
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>   at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
>   at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
>   at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:588)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>   at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:50 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:37 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

This is the call flow:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:36 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream and throw the exception 
back to the caller (TFileAggregatedLogsBlock.render) so that it is able to 
catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: 

[jira] [Comment Edited] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

2020-03-31 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071542#comment-17071542
 ] 

Siddharth Ahuja edited comment on YARN-10207 at 3/31/20, 7:35 AM:
--

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader is getting instantiated, 
specifically, when TFile.Reader creation within the 
AggregatedLogFormat.LogReader's constructor fails due to a corrupted file 
passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream if we do indeed hit any 
and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so 
that it is able to catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    
        yarn.log-aggregation.file-formats
        IndexedFormat
    
    
        yarn.log-aggregation.file-controller.IndexedFormat.class
        
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController
    
    
        yarn.log-aggregation.IndexedFormat.remote-app-log-dir
        /tmp/ifilelogs
    
    
        
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
        ifilelogs
    
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: