You can try:

- Using KryoSerializer
- Enabling RDD Compression
- Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER


Thanks
Best Regards

On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard <bradwill...@gmail.com> wrote:

> I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
> dataset from json files. When I load the dataset it is about 200gb however
> it only creates 60 partitions. I'm trying to repartition to 256 to increase
> cpu utilization however when I do that it balloons in memory to way over 2x
> the initial size killing nodes from memory failures.
>
>
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
>
> Is this a bug? How can I work around this.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-Memory-Leak-tp20965.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to