[
https://issues.apache.org/jira/browse/YARN-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated YARN-11188:
------------------------------
Target Version/s: 3.4.0
Affects Version/s: 3.4.0
> Only files belong to the first file controller are removed even if multiple
> log aggregation file controllers are configured
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-11188
> URL: https://issues.apache.org/jira/browse/YARN-11188
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation
> Affects Versions: 3.4.0
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> Log aggregation can be configured to have a comma-separated list of file
> controllers.
> The current behaviour only removes files that belong to the first file
> controller.
> This can be problematic.
> For example, if some user configures IFile as the file controller, and later
> on changes the file controllers to specify multiple file controllers (e.g.
> value = TFile,IFile) then only the first controller will be considered and
> the files belong to that controller will be removed, in this case files
> written by the TFile controller will be removed and the files created with
> the IFile controller will be kept.
> This behaviour should be changed so that all of the files should be removed
> if multiple file controllers are enabled.
> h2. CODE PATH
> ----
> 1.
> [AggregatedLogDeletionService$LogDeletionTask#run|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108]:
>
> Let's understand what does this method do.
> 1.1 An important bit is to see how the value of the field called
> 'retentionMillis' is set. In the constructor of LogDeletionTask, there's an
> incoming parameter called 'retentionSecs' that is just multiplied by 1000 to
> have a millisecond value.
> Let's see where 'retentionSecs' is coming from.
> 1.2
> [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L283]
> that sets the value of retentionSecs.
> The config key for this value is 'yarn.log-aggregation.retain-seconds'.
> The javadoc says: "How long to wait before deleting aggregated logs, -1
> disables. Be careful set this too small and you will spam the name node."
> 1.3 Going back to
> [https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108],
> the 'cutOffMillis' value is computed by getting the current time in millis
> minus the retentionMillis.
> 1.4 The main point of this method is to iterate over the files in the remote
> root log dir (field called 'remoteRootLogDir') and to check if it is a
> directory. If so, a new Path is created with that particular directory ([code
> link|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L90-L96]).
> One more important thing to mention: There's a field called 'suffix' that is
> added to the remote root log dir path.
> Let's check how the 'remoteRootLogDir' and 'suffix' field get its value as
> this is crucial to understand how the log dirs are deleted.
> 1.5 remoteRootLogDir is set in the constructor of LogDeletionTask,
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L77].
> The value is returned by calling fileController.getRemoteRootLogDir().
> The LogAggregationFileControllerFactory creates the instance of
> LogAggregationFileController.
> ----
> *The process of determining the log aggregation file controller is quite
> messy, let me describe this in detail.*
> *There are 2 types of file controllers: LogAggregationIndexedFileController
> and LogAggregationTFileController*
> *There's a testcase called
> [TestLogAggregationFileControllerFactory#testLogAggregationFileControllerFactory|#testLogAggregationFileControllerFactory]
> that shows how the LogAggregationFileControllerFactory is configured.*
> 2.1 First, some important configs:
> 2.1.1 Generic config key for the log aggregation file controller class:
> yarn.log-aggregation.file-controller.<controllerName>.class
> An example real-world config key:
> yarn.log-aggregation.file-controller.IFile.class
> An example real-world config value: LogAggregationFileController.class
> 2.1.2 Generic config key for the log aggregation file controller's remote app
> log dir:
> yarn.log-aggregation.<controllerName>.remote-app-log-dir
> An example real-world config key:
> yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world config value: /tmp/logs/IFile/
> 2.1.3 Generic config key for the log aggregation file controller's remote app
> log dir suffix:
> yarn.log-aggregation.<controllerName>.remote-app-log-dir-suffix
> An example real-world config key:
> yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> 2.1.4 There's one more config called 'yarn.log-aggregation.file-formats',
> that can store a comma separated list of file controllers.
> An example value: IFile,TFile
> 2.2 Let's examine how the [LogAggregationFileControllerFactory's
> contstructor|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L63-L80]
> works.
> 2.2.1 There's [an
> iteration|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L69]
> over file controllers.
> 2.2.2
> The remote app log dir per file controller is [read from the
> config|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L196-L216]
> An example for a config key: yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world value of this config: /tmp/logs/IFile/
> 2.2.3 If the specified remote app log dir is null or empty, the remote dir
> for the particular file controller falls back to the NM's log dir.
> The log dir is either specified by the config
> 'yarn.nodemanager.remote-app-log-dir' or falls back to the default path
> '/tmp/logs'.
> This logic is implemented
> [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L208-L215]
> 2.2.4 Next, the remote app log dir suffix is read
> [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L225-L232].
> Example config key: yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> If the suffix is null or empty, the suffix is tried to read by the value of
> config key 'yarn.nodemanager.remote-app-log-dir-suffix' or if it's not
> specified still, the default prefix will be 'logs'.
> 2.2.5 Now we now the remoteDir (/tmp/logs/IFile/) + the suffix (IFile), we
> just concatenate them and add a hyphen in between, so the final value will
> be: target/app-logs/IFile/-IFile [TODO]
> 2.2.6 The rest of the method reads the log aggregation file controller's
> class name and initializes the controller. This is implemented
> [here|hhttps://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L82-L95].
> An example config key for the class:
> 'yarn.log-aggregation.file-controller.IFile.class'
> An example value of this config:
> "org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController"
> 2.2.7 Next, the controller is created by creating a new instance of the class
> with reflection.
> 2.2.8 An important bit is to [initialize the
> controller|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L77]
> 2.2.9 The initialize method [is implemented in
> LogAggregationFileController|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L121-L140],
> which is an abstract base class for the file controllers.
> 2.2.10 The remote root log dir + the suffix [is
> read|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L136-L137]
> by the same config logic as described above.
> 2.2.11 As a final step, the controller instance is [added to the factory's
> controllers
> list|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L78]
> 2.2.3 Now we know how the LogAggregationFileControllerFactory works and how
> it reads the config to create and store the File controller instances.
> Let's jump back to the constructor of
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService.LogDeletionTask#LogDeletionTask.
> The file controller is determined by calling the 'getFileControllerForWrite'
> method on the LogAggregationFileControllerFactory instance,
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L75].
> 2.2.4 [The
> method|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L128]
> is quite simple, it just returns the first element from the list, so if
> multiple log aggregation file controllers were instantiated during the
> initialization (as per the config), always the first instance will be
> returned here.
> ----
> *WE need to jump back to step 1.4 and 1.5, where the files are being listed
> with the help of the abstract FileSystem implementation
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L93-L97].*
> *So we know how the values for 'remoteRootLogDir' and 'suffix' are set as
> described in detail above.*
> ----
> 1.6 Let's see what the deleteOldLogDirsFrom method does since this is the
> main call of the loop that lists the log dirs.
> [The
> method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L110-L122]
> is very simple: It accepts a Path as a parameter (which we know that it is a
> directory), it lists the dirs from this main directory and iterates over the
> dirs and [calls
> deleteAppDirLogs|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L120].
> 1.7 The [deleteAppDirLogs
> method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L124-L165]
> is quite messy again.
> 1.7.1 Parameters are:
> cutOffMillis: The 'cutOffMillis' value is computed by getting the current
> time in millis minus the retentionMillis that is coming from the
> configuration.
> If it's set to 2 minutes, the calculated time will be NOW-2 minutes in
> milliseconds.
> fs: The abstract FileSystem implementation
> rmClient: Not important for us right now
> appDir: The directory to clean up
> 1.7.2 The whole method only does anything useful if the directory's
> modification time < cutOffMillis. What this means in practice is that only
> the dirs that are modified earlier than the retention period will be touched
> / deleted.
> 1.7.3 If the app is not terminated, we list the directory and try to remove
> the log files. Only the log files will be deleted that are having a
> modification time which is earlier than the retention period.
> [This is the
> logic|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L133-L152]
> that implements this.
> 1.7.4 [The other part of the if
> condition|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L152-L160]
> tries to delete the log dir, but checks if the return value of
> 'shouldDeleteLogDir' is true, first.
> 1.7.5 Let's check the method
> [AggregatedLogDeletionService.LogDeletionTask#shouldDeleteLogDir|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L167-L182]:
>
> This is basically the same logic as the other retention period based logic
> that I described above.
> We set shouldDelete to true by default, then set it to false only if the
> modification date of the dir itself is later than the timestampt that is
> defined by the retention period.
> ----
> h2. CONCLUSION
> *We just checked the implementation of how the log aggregation file
> controllers are instantiated and configured.*
> *Just by reading the code + the logic, I think reading / parsing the
> configuration is okay.*
> *What really bothers me is how the file controller instance is getting
> created by the factory (step 2.2.3).*
> *If multiple log aggregation file controllers (TFile + IFile) are configured,
> always the 0th item (first) will be picked by the factory. This is resulting
> in the incorrect behaviour so that only one controller's files will be
> cleaned up.*
> *As the
> [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L277]
> method just creates the LogDeletionTask instance once and schedules it on a
> fixed rate with the help of a Timer, there's no distinction in log
> aggregation File controllers on this abstraction, meaning that only the
> LogAggregationFileControllerFactory could return different file controllers.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]