Brad Willard created SPARK-5075:
-----------------------------------

             Summary: Memory Leak when repartitioning SchemaRDD from JSON
                 Key: SPARK-5075
                 URL: https://issues.apache.org/jira/browse/SPARK-5075
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 1.2.0
         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
            Reporter: Brad Willard


I'm trying to repartition a json dataset for better cpu optimization and save 
in parquet format for better performance. The Json dataset is about 200gb

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

rdd = sql_context.jsonFile('s3c://some_path')
rdd = rdd.repartition(256)
rdd.saveAsParquetFile('hdfs://some_path')

In ganglia when the dataset first loads it's about 200G in memory which is 
expected. However once it attempts the repartition, it balloons over 2.5x in 
memory which is never returned making any subsequent operations fail from 
memory errors.

https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to