Hidayat Teonadi created YARN-8991:
-------------------------------------

             Summary: nodemanager not cleaning blockmgr directories inside 
appcache 
                 Key: YARN-8991
                 URL: https://issues.apache.org/jira/browse/YARN-8991
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.6.0
            Reporter: Hidayat Teonadi
         Attachments: yarn-nm-log.txt

Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
noticing that during the lifetime of my spark streaming application, the nm 
appcache folder is building up with blockmgr directories (filled with 
shuffle_*.data).

Looking into the nm logs, it seems like the blockmgr directories is not part of 
the cleanup process of the application. Eventually disk will fill up and app 
will crash. I have both 
{{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
{{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its a 
configuration issue.

What is stumping me is the executor ID listed by spark during the external 
shuffle block registration doesn't match the executor ID listed in yarn's nm 
log. Maybe this executorID disconnect explains why the cleanup is not done ? 
I'm assuming that blockmgr directories are supposed to be cleaned up ?

 
{noformat}
2018-11-05 15:01:21,349 INFO 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}

 {noformat}
 

seems similar to https://issues.apache.org/jira/browse/YARN-7070, although I'm 
not sure if the behavior I'm seeing is spark use related.

[https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
 has a stop gap solution of cleaning up via cron.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to