You can disable shuffle spill (spark.shuffle.spill <http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>) if you are having enough memory to hold that much data. I believe adding more resources would be your only choice.
Thanks Best Regards On Thu, Jun 11, 2015 at 9:46 PM, Al M <alasdair.mcbr...@gmail.com> wrote: > I am using Spark on a machine with limited disk space. I am using it to > analyze very large (100GB to 1TB per file) data sets stored in HDFS. When > I > analyze these datasets, I will run groups, joins and cogroups. All of > these > operations mean lots of shuffle files written to disk. > > Unfortunately what happens is my disk fills up very quickly (I only have > 40GB free). Then my process dies because I don't have enough space on > disk. > I don't want to write my shuffles to HDFS because it's already pretty full. > The shuffle files are cleared up between runs, but this doesnt help when a > single run requires 300GB+ shuffle disk space. > > Is there any way that I can limit the amount of disk space used by my > shuffles? I could set up a cron job to delete old shuffle files whilst the > job is still running, but I'm concerned that they are left there for a good > reason. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-tp23279.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >