[
https://issues.apache.org/jira/browse/KYLIN-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835418#comment-17835418
]
pengfei.zhan commented on KYLIN-5789:
-
h1. Design
!KYLIN_5789.png!
h2. Store the root path:
Default Configuration
{code:java}
kylin.engine.spark-conf.spark.history.fs.logDirectory=${kylin.env.hdfs-working-dir}/spark-history
kylin.engine.spark-conf.spark.eventLog.dir=${kylin.env.hdfs-working-dir}/spark-history
kylin.storage.columnar.spark-conf.spark.eventLog.dir=${kylin.env.hdfs-working-dir}/sparder-history
kylin.storage.columnar.spark-conf.spark.eventLog.rolling.enabled=true
kylin.storage.columnar.spark-conf.spark.eventLog.rolling.maxFileSize=100m {code}
sparder:
${kylin.storage.columnar.spark-conf.spark.eventLog.dir}/hostname_port/build:
\{kylin.engine.spark-conf.spark.eventLog.dir}
spark history of building job supports project-level configuration.
h2. Storage Format
*Sparder:* Related default parameters:
kylin.storage.columnar.spark-conf.spark.eventLog.rolling.enabled=true
Sparder enables trolling by default, which creates a directory for each Spark
Application to store event logs.
The folder name for event logs is in the format: eventlog_v2_appId().
The event logs folder stores the event logs of the corresponding application.
The event log file name format is: events_\{file_index}_\{appid}_\{timestamp}.
When Sparder is not finished, there is an empty file
appstatus_\{appId}.inprogress.
When Sparder finishes normally, the inprogress suffix is removed.
*Job:* The spark event log for each build task is saved in a single file, and
the .inprogress suffix is used to indicate if the event log has not completed.
h2. Cleanup Strategy:
Build task cleanup time threshold:
kylin.garbage.storage.executable-survival-time-threshold, default 30d
Query history cleanup time threshold:
kylin.query.queryhistory.survival-time-threshold, default 30d
h2. Scheduler task
For query eventlog , each job and all nodes will perform the cleanup task
regularly, the global node will broadcast the request to clean up the sparder
eventlog to all query nodes (http://ip:port/kylin/api/system/clean_sparder_
eventslogs), each KE node will only clean up the sparder event files under the
current startup port directory, which is
${kylin.storage.columnar.spark-conf.spark.eventLog.dir}/${hostname_port}, the
files under this directory. If the folder starts with eventlog_v2, delete all
files in this directory when lastmodifytime < min (configured time threshold,
the end time of the first queryhistory).
For build eventlog, you need to iterate through the project-level configuration
of all projects, \{kylin.engine.spark-conf.spark.eventLog.dir}/spark-history
The files in this directory, if they start with application_, will be deleted
when lastmodifytime < min(configured time threshold , the end time of the first
queryhistory) is deleted. lastmodifytime < min (the configured time threshold
for the end of the earliest build) will be deleted.
h2. FastRoutineTool
For the query eventlog, since the command line tool is directly related to the
port on which KE starts, clean up the files in the
${kylin.storage.columnar.spark-conf.spark.eventLog.dir} directory, and delete
all the files in the folder starting with hostname_port when lastmodifytime <
min(configured time threshold, the end time of the first queryhistory) will be
deleted.
For build eventlog, you need to iterate through the project level configuration
of all projects, \{kylin.engine.spark-conf.spark.eventLog.dir}/spark-history
The files in this directory, if they start with application_, will be deleted
when lastmodifytime < min(configured time threshold, the end time of the first
queryhistory). lastmodifytime < min (the configured time threshold for the end
of the earliest build) will be deleted.
h2. RoutineTool
For the query eventlog, since the command line tool is not directly related to
the port where KE is started, clean up the files in the
${kylin.storage.columnar.spark-conf.spark.eventLog.dir} directory, and delete
all the files in the folder starting with hostname_port when lastmodifytime <
min(configured time threshold, the end time of the first queryhistory) will be
deleted.
For build eventlog, you need to iterate through the project level configuration
of all projects, \{kylin.engine.spark-conf.spark.eventLog.dir}/spark-history
The files in this directory, if they start with application_, will be deleted
when lastmodifytime < min(configured time threshold, the end time of the first
queryhistory). lastmodifytime < min (the configured time threshold for the end
of the earliest build) will be deleted.
> Clean sparder history and spark history automatically
> -
>
> Key: KYLIN-5789
> URL: https://issues.apache.org/jira/browse/KYLIN-5789
>