[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2021-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454800#comment-17454800
 ] 

Apache Spark commented on SPARK-23607:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/34829

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>  Labels: bulk-closed
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2021-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454799#comment-17454799
 ] 

Apache Spark commented on SPARK-23607:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/34829

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>  Labels: bulk-closed
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2021-12-07 Thread Thejdeep Gudivada (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454797#comment-17454797
 ] 

Thejdeep Gudivada commented on SPARK-23607:
---

Posted a preview PR for this, will be adding tests for it.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>  Labels: bulk-closed
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2020-05-13 Thread Zirui Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106562#comment-17106562
 ] 

Zirui Li commented on SPARK-23607:
--

Hi [~zhouyejoe] wondering do you have any plan to post the PR? Thanks

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>  Labels: bulk-closed
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-12 Thread Ye Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396193#comment-16396193
 ] 

Ye Zhou commented on SPARK-23607:
-

[~vanzin] Cool. I will post a PR soon. Thanks.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388230#comment-16388230
 ] 

Marcelo Vanzin commented on SPARK-23607:


I think this is a nice trick to speed things up, even though it only works for 
HDFS. I have some ideas on how to have a more generic speed up in this code, 
just haven't had the time to sit down and try them out, but this could help in 
the meantime.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-05 Thread Ye Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387126#comment-16387126
 ] 

Ye Zhou commented on SPARK-23607:
-

[~vanzin] Any comments? Thanks.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org