Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
different stages of execution and creates a result parquet of 9 GB (about 27
million rows containing 165 columns. some columns are map based containing
utmost 200 value histograms). The stages involve,
Step 1: Reading the data using dataframe api 
Step 2: Transform dataframe to RDD (as the some of the columns are
transformed into histograms (using empirical distribution to cap the number
of keys) and some of them run like UDAF during reduce-by-key step) to
perform and perform some transformations 
Step 3: Reduce the result by key so that the resultant can be used in the
next stage for join
Step 4: Perform left outer join of this result which runs similar Steps 1
thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor
memory per node with driver program using 32 GB. The approximate execution
time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
for read and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor
node's JVM gets restarted (with a new executor id). On further turning-on GC
stats on the executor, the perm-gen seem to get maxed out and ends up
showing the symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to