[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454800#comment-17454800 ] Apache Spark commented on SPARK-23607: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/34829 > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Labels: bulk-closed > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454799#comment-17454799 ] Apache Spark commented on SPARK-23607: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/34829 > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Labels: bulk-closed > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454797#comment-17454797 ] Thejdeep Gudivada commented on SPARK-23607: --- Posted a preview PR for this, will be adding tests for it. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Labels: bulk-closed > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106562#comment-17106562 ] Zirui Li commented on SPARK-23607: -- Hi [~zhouyejoe] wondering do you have any plan to post the PR? Thanks > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Labels: bulk-closed > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396193#comment-16396193 ] Ye Zhou commented on SPARK-23607: - [~vanzin] Cool. I will post a PR soon. Thanks. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388230#comment-16388230 ] Marcelo Vanzin commented on SPARK-23607: I think this is a nice trick to speed things up, even though it only works for HDFS. I have some ideas on how to have a more generic speed up in this code, just haven't had the time to sit down and try them out, but this could help in the meantime. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387126#comment-16387126 ] Ye Zhou commented on SPARK-23607: - [~vanzin] Any comments? Thanks. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org