You can try: - Using KryoSerializer - Enabling RDD Compression - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER
Thanks Best Regards On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard <bradwill...@gmail.com> wrote: > I have a 10 node cluster with 600gb of ram. I'm loading a fairly large > dataset from json files. When I load the dataset it is about 200gb however > it only creates 60 partitions. I'm trying to repartition to 256 to increase > cpu utilization however when I do that it balloons in memory to way over 2x > the initial size killing nodes from memory failures. > > > https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png > > Is this a bug? How can I work around this. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-Memory-Leak-tp20965.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >