Re: Spark not releasing shuffle files in time (with very large heap)
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvpwrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > > and the post mentioned there > https://forums.databricks.com/questions/277/how-do-i-avoid- > the-no-space-left-on-device-error.html > > also try compressing the output > https://spark.apache.org/docs/latest/configuration.html# > compression-and-serialization > spark.shuffle.compress > > thanks > Vijay > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Twitter: https://twitter.com/holdenkarau
Re: Spark not releasing shuffle files in time (with very large heap)
have you looked at http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html and the post mentioned there https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html also try compressing the output https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization spark.shuffle.compress thanks Vijay -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark not releasing shuffle files in time (with very large heap)
Got it. I understood issue in different way. On Thu, Feb 22, 2018 at 9:19 PM Keith Chapmanwrote: > My issue is that there is not enough pressure on GC, hence GC is not > kicking in fast enough to delete the shuffle files of previous iterations. > > Regards, > Keith. > > http://keith-chapman.com > > On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud > wrote: > >> It would be very difficult to tell without knowing what is your >> application code doing, what kind of transformation/actions performing. >> From my previous experience tuning application code which avoids >> unnecessary objects reduce pressure on GC. >> >> >> On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman >> wrote: >> >>> Hi, >>> >>> I'm benchmarking a spark application by running it for multiple >>> iterations, its a benchmark thats heavy on shuffle and I run it on a local >>> machine with a very large hear (~200GB). The system has a SSD. When running >>> for 3 to 4 iterations I get into a situation that I run out of disk space >>> on the /tmp directory. On further investigation I was able to figure out >>> that the reason for this is that the shuffle files are still around, >>> because I have a very large hear GC has not happen and hence the shuffle >>> files are not deleted. I was able to confirm this by lowering the heap size >>> and I see GC kicking in more often and the size of /tmp stays under >>> control. Is there any way I could configure spark to handle this issue? >>> >>> One option that I have is to have GC run more often by >>> setting spark.cleaner.periodicGC.interval to a much lower value. Is there a >>> cleaner solution? >>> >>> Regards, >>> Keith. >>> >>> http://keith-chapman.com >>> >> >> >
Re: Spark not releasing shuffle files in time (with very large heap)
My issue is that there is not enough pressure on GC, hence GC is not kicking in fast enough to delete the shuffle files of previous iterations. Regards, Keith. http://keith-chapman.com On Thu, Feb 22, 2018 at 6:58 PM, naresh Goudwrote: > It would be very difficult to tell without knowing what is your > application code doing, what kind of transformation/actions performing. > From my previous experience tuning application code which avoids > unnecessary objects reduce pressure on GC. > > > On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman > wrote: > >> Hi, >> >> I'm benchmarking a spark application by running it for multiple >> iterations, its a benchmark thats heavy on shuffle and I run it on a local >> machine with a very large hear (~200GB). The system has a SSD. When running >> for 3 to 4 iterations I get into a situation that I run out of disk space >> on the /tmp directory. On further investigation I was able to figure out >> that the reason for this is that the shuffle files are still around, >> because I have a very large hear GC has not happen and hence the shuffle >> files are not deleted. I was able to confirm this by lowering the heap size >> and I see GC kicking in more often and the size of /tmp stays under >> control. Is there any way I could configure spark to handle this issue? >> >> One option that I have is to have GC run more often by >> setting spark.cleaner.periodicGC.interval to a much lower value. Is >> there a cleaner solution? >> >> Regards, >> Keith. >> >> http://keith-chapman.com >> > >
Re: Spark not releasing shuffle files in time (with very large heap)
It would be very difficult to tell without knowing what is your application code doing, what kind of transformation/actions performing. From my previous experience tuning application code which avoids unnecessary objects reduce pressure on GC. On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapmanwrote: > Hi, > > I'm benchmarking a spark application by running it for multiple > iterations, its a benchmark thats heavy on shuffle and I run it on a local > machine with a very large hear (~200GB). The system has a SSD. When running > for 3 to 4 iterations I get into a situation that I run out of disk space > on the /tmp directory. On further investigation I was able to figure out > that the reason for this is that the shuffle files are still around, > because I have a very large hear GC has not happen and hence the shuffle > files are not deleted. I was able to confirm this by lowering the heap size > and I see GC kicking in more often and the size of /tmp stays under > control. Is there any way I could configure spark to handle this issue? > > One option that I have is to have GC run more often by > setting spark.cleaner.periodicGC.interval to a much lower value. Is there > a cleaner solution? > > Regards, > Keith. > > http://keith-chapman.com >