Yes as Josh said, when application is started, Spark will create a unique
application-wide folder for related temporary files. And jobs in this
application will have a unique shuffle id with unique file names, so
shuffle stages within app will not meet name conflicts.
Also shuffle files between
DIskBlockManager doesn't need to know the app id, all it need to do is to
create a folder with a unique name (UUID based) and then put all the
shuffle files into it.
you can see the code in DiskBlockManager as below, it will create a bunch
unique folders when initialized, these folders are app
: Wednesday, March 25, 2015 7:40 PM
To: Saisai Shao; Kannan Rajah
Cc: dev@spark.apache.org
Subject: Re: Understanding shuffle file name conflicts
Hi Jerry Josh
It has been a while since the last time I looked into Spark core shuffle code,
maybe I’m wrong here. But the shuffle ID is created along
: Wednesday, March 25, 2015 7:40 PM
To: Saisai Shao; Kannan Rajah
Cc: dev@spark.apache.org
Subject: Re: Understanding shuffle file name conflicts
Hi Jerry Josh
It has been a while since the last time I looked into Spark core shuffle code,
maybe I’m wrong here. But the shuffle ID is created along
Hi Kannan,
As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even
I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
/tmp. It does not happen with normal code path that uses DiskBlockManager
to pick