[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

Eric Vandenberg (JIRA) Mon, 31 Jul 2017 15:09:17 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Vandenberg updated SPARK-21571:
------------------------------------
    Description: 
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw----   3 xxxxxx                                           ooooooo      
14382 2016-09-13 15:40 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard
-rw-rw----   3 xxxx                                             ooooooo       
5933 2016-11-01 20:16 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4
-rw-rw----   3 yyy                                              ooooooo         
 0 2017-01-19 11:59 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress
-rw-rw----   3 xxxxxxxxx                                        ooooooo         
 0 2017-01-19 14:17 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress
-rw-rw----   3 yyy                                              ooooooo         
 0 2017-01-20 10:56 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11955 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11958 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11960 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress and not paresable/can't extract appID, is it 
completely ignored
3. if a file is complete and but not parseable/can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.


  was:
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw----   3 xxxxxx                                           ooooooo      
14382 2016-09-13 15:40 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard
-rw-rw----   3 xxxx                                             ooooooo       
5933 2016-11-01 20:16 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4
-rw-rw----   3 yyy                                              ooooooo         
 0 2017-01-19 11:59 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress
-rw-rw----   3 xxxxxxxxx                                        ooooooo         
 0 2017-01-19 14:17 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress
-rw-rw----   3 yyy                                              ooooooo         
 0 2017-01-20 10:56 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11955 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11958 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress
-rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
11960 2017-01-20 17:55 
/user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress, is it completely ignored
3. if a file is complete and but not parseable, or can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.



> Spark history server leaves incomplete or unreadable history files around 
> forever.
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-21571
>                 URL: https://issues.apache.org/jira/browse/SPARK-21571
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.2.0
>            Reporter: Eric Vandenberg
>            Priority: Minor
>
> We have noticed that history server logs are sometimes never cleaned up.  The 
> current history server logic *ONLY* cleans up history files if they are 
> completed since in general it doesn't make sense to clean up inprogress 
> history files (after all, the job is presumably still running?)  Note that 
> inprogress history files would generally not be targeted for clean up any way 
> assuming they regularly flush logs and the file system accurately updates the 
> history log last modified time/size, while this is likely it is not 
> guaranteed behavior.
> As a consequence of the current clean up logic and a combination of unclean 
> shutdowns, various file system bugs, earlier spark bugs, etc. we have 
> accumulated thousands of these dead history files associated with long since 
> gone jobs.
> For example (with spark.history.fs.cleaner.maxAge=14d):
> -rw-rw----   3 xxxxxx                                           ooooooo      
> 14382 2016-09-13 15:40 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard
> -rw-rw----   3 xxxx                                             ooooooo       
> 5933 2016-11-01 20:16 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4
> -rw-rw----   3 yyy                                              ooooooo       
>    0 2017-01-19 11:59 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress
> -rw-rw----   3 xxxxxxxxx                                        ooooooo       
>    0 2017-01-19 14:17 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress
> -rw-rw----   3 yyy                                              ooooooo       
>    0 2017-01-20 10:56 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11955 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11958 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11960 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress
> Based on the current logic, clean up candidates are skipped in several cases:
> 1. if a file has 0 bytes, it is completely ignored
> 2. if a file is in progress and not paresable/can't extract appID, is it 
> completely ignored
> 3. if a file is complete and but not parseable/can't extract appID, it is 
> completely ignored.
> To address this edge case and provide a way to clean out orphaned history 
> files I propose a new configuration option:
> spark.history.fs.cleaner.aggressive={true, false}, default is false.
> If true, the history server will more aggressively garbage collect history 
> files in cases (1), (2) and (3).  Since the default is false, existing 
> customers won't be affected unless they explicitly opt-in.  If customers have 
> similar leaking garbage over time they have the option of aggressively 
> cleaning up in such cases.  Also note that aggressive clean up may not be 
> appropriate for some customers if they have long running jobs that exceed the 
> cleaner.maxAge time frame and/or have buggy file systems.
> Would like to get feedback on if this seems like a reasonable solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

Reply via email to