[ 
https://issues.apache.org/jira/browse/YARN-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620464#comment-16620464
 ] 

Tarun Parimi commented on YARN-8627:
------------------------------------

Thanks for the review [~rohithsharma]. 

I tested for folder path appid/appid/appid and this patch handles it fine. This 
is because only the first appid directory encountered will be deleted 
recursively after its child directories have been tested for modification time. 

I agree that we should try to find root cause for the actual creation of 
repeated directories. I wasn't able to reproduce this locally so wasn't able to 
dig much deeper.

I had looked at the hadoop fs -ls -R output of /ats/done for the cluster in 
which I had observed the issue. One thing I noticed is that only the 
"domainlog" file was present in these type of repeated appid directories. Other 
types such as summarylog/entitylog were present only in the normal expected 
directory structure. Also two domainlogs are created and they have different 
size and modification time of one causing problem is much greater at 13:00. But 
not sure on the exact scenario which is causing this to happen. A sample is 
below.

 
{code:java}
drwxrwx--- - appuser hadoop 0 2017-10-16 13:01 
/ats/done/1508116310016/0000/000/application_1508116310016_0010
drwxrwx--- - appuser hadoop 0 2017-10-16 12:16 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 88 2017-10-16 12:20 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001/domainlog-appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 92324 2017-10-16 12:22 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/appattempt_1508116310016_0010_000001/summarylog-appattempt_1508116310016_0010_000001
drwxrwxrwx - appuser hadoop 0 2017-10-16 13:00 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010
drwxrwxrwx - appuser hadoop 0 2017-10-16 13:00 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010/appattempt_1508116310016_0010_000001
-rw-r----- 3 appuser hadoop 90 2017-10-16 13:00 
/ats/done/1508116310016/0000/000/application_1508116310016_0010/application_1508116310016_0010/appattempt_1508116310016_0010_000001/domainlog-appattempt_1508116310016_0010_000001
 
{code}
 

> EntityGroupFSTimelineStore hdfs done directory keeps on accumulating
> --------------------------------------------------------------------
>
>                 Key: YARN-8627
>                 URL: https://issues.apache.org/jira/browse/YARN-8627
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Tarun Parimi
>            Assignee: Tarun Parimi
>            Priority: Major
>         Attachments: YARN-8627.001.patch, YARN-8627.002.patch
>
>
> The EntityLogCleaner threads exits with the following ERROR every time it 
> runs.  
> {code:java}
> 2018-07-18 19:59:39,837 INFO timeline.EntityGroupFSTimelineStore 
> (EntityGroupFSTimelineStore.java:cleanLogs(462)) - Deleting 
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18268
> 2018-07-18 19:59:39,844 INFO timeline.EntityGroupFSTimelineStore 
> (EntityGroupFSTimelineStore.java:cleanLogs(462)) - Deleting 
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18270
> 2018-07-18 19:59:39,848 ERROR timeline.EntityGroupFSTimelineStore 
> (EntityGroupFSTimelineStore.java:run(899)) - Error cleaning files  
> java.io.FileNotFoundException: File 
> hdfs://namenode/ats/done/1499684568068/0000/018/application_1499684568068_18270
>  does not exist.  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1062)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1069)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1040)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1019)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1015)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1015)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.shouldCleanAppLogDir(EntityGroupFSTimelineStore.java:480)
>  
> {code}
>  
>  Each time the thread gets scheduled, it is a different folder encountering 
> the error. As a result, the thread is not able to clean all the old done 
> directories, since it stops after this error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to