[jira] [Created] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

Ye Zhou (JIRA) Mon, 05 Mar 2018 17:41:29 -0800

Ye Zhou created SPARK-23607:
-------------------------------

             Summary: Use HDFS extended attributes to store application summary 
to improve the Spark History Server performance
                 Key: SPARK-23607
                 URL: https://issues.apache.org/jira/browse/SPARK-23607
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, Web UI
    Affects Versions: 2.3.0
            Reporter: Ye Zhou
             Fix For: 2.4.0



Currently in Spark History Server, checkForLogs thread will create replaying 
tasks for log files which have file size change. The replaying task will filter 
out most of the log file content and keep the application summary including 
applicationId, user, attemptACL, start time, end time. The application summary 
data will get updated into listing.ldb and serve the application list on SHS 
home page. For a long running application, its log file which name ends with 
"inprogress" will get replayed for multiple times to get these application 
summary. This is a waste of computing and data reading resource to SHS, which 
results in the delay for application to get showing up on home page. Internally 
we have a patch which utilizes HDFS extended attributes to improve the 
performance for getting application summary in SHS. With this patch, Driver 
will write the application summary information into extended attributes as 
key/value. SHS will try to read from extended attributes. If SHS fails to read 
from extended attributes, it will fall back to read from the log file content 
as usual. This feature can be enable/disable through configuration.

It has been running fine for 4 months internally with this patch and the last 
updated timestamp on SHS keeps within 1 minute as we configure the interval to 
1 minute. Originally we had long delay which could be as long as 30 minutes in 
our scale where we have a large number of Spark applications running per day.

We want to see whether this kind of approach is also acceptable to community. 
Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

Reply via email to