[
https://issues.apache.org/jira/browse/YARN-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620464#comment-16620464
]
Tarun Parimi commented on YARN-8627:
------------------------------------
Thanks for the review [~rohithsharma].
I tested for folder path appid/appid/appid and this patch handles it fine. This
is because only the first appid directory encountered will be deleted
recursively after its child directories have been tested for modification time.
I agree that we should try to find root cause for the actual creation of
repeated directories. I wasn't able to reproduce this locally so wasn't able to
dig much deeper.
I had looked at the hadoop fs -ls -R output of /ats/done for the cluster in
which I had observed the issue. One thing I noticed is that only the
"domainlog" file was present in these type of repeated appid directories. Other
types such as summarylog/entitylog were present only in the normal expected
directory structure. Also two domainlogs are created and they have different
size and modification time of one causing problem is much greater at 13:00. But
not sure on the exact scenario which is causing this to happen. A sample is
below.
{code:java}
drwxrwx--- - appuser hadoop 0 2017-10-16 13:01
/ats/done/1508116310016/0000/000/application_1508116310016_0010
drwxrwx--- - appuser hadoop 0 2017-10-16 12:16
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 88 2017-10-16 12:20
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001/domainlog-appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 92324 2017-10-16 12:22
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001/summarylog-appattempt_1508116310016_0010_000001
drwxrwxrwx - appuser hadoop 0 2017-10-16 13:00
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010
drwxrwxrwx - appuser hadoop 0 2017-10-16 13:00
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010/appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 90 2017-10-16 13:00
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010/appattempt_1508116310016_0010_000001/domainlog-appattempt_1508116310016_0010_000001
{code}
> EntityGroupFSTimelineStore hdfs done directory keeps on accumulating
> --------------------------------------------------------------------
>
> Key: YARN-8627
> URL: https://issues.apache.org/jira/browse/YARN-8627
> Project: Hadoop YARN
> Issue Type: Bug
> Components: timelineserver
> Affects Versions: 2.8.0
> Reporter: Tarun Parimi
> Assignee: Tarun Parimi
> Priority: Major
> Attachments: YARN-8627.001.patch, YARN-8627.002.patch
>
>
> The EntityLogCleaner threads exits with the following ERROR every time it
> runs.
> {code:java}
> 2018-07-18 19:59:39,837 INFO timeline.EntityGroupFSTimelineStore
> (EntityGroupFSTimelineStore.java:cleanLogs(462)) - Deleting
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18268
> 2018-07-18 19:59:39,844 INFO timeline.EntityGroupFSTimelineStore
> (EntityGroupFSTimelineStore.java:cleanLogs(462)) - Deleting
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18270
> 2018-07-18 19:59:39,848 ERROR timeline.EntityGroupFSTimelineStore
> (EntityGroupFSTimelineStore.java:run(899)) - Error cleaning files
> java.io.FileNotFoundException: File
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18270
> does not exist. at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1062)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1069)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1040)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1019)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1015)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1015)
> at
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.shouldCleanAppLogDir(EntityGroupFSTimelineStore.java:480)
>
> {code}
>
> Each time the thread gets scheduled, it is a different folder encountering
> the error. As a result, the thread is not able to clean all the old done
> directories, since it stops after this error.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]