I'm building a spark job against Spark 1.6.0 / EMR 4.4 in Scala.

I'm attempting to concat a bunch of dataframe columns then explode them into
new rows. (just using the built in concat and explode functions) Works great
in my unit test.

But I get out of memory issues when I run against my production data (25GB
of bzip2 compressed pipe delimited text split out into about 30 individual
files):

The exception I get is:

ExecutorLostFailure (executor 8 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

EMR has set spark.executor.memory = 5120M

Since I don't really mind swapping to disk, I tried fiddling with the memory
settings:

      .set("spark.memory.fraction","0.15")
      .set("spark.memory.storageFraction","0.15")

Which does indeed help - but I still get a number of OutOfMemory problems
that kill executors.

I have some very wide DataFrames comprised of 289 string columns once all
joined together.
I used the SizeEstimator to get a read on the size of a single row and it
clocked in at 2.4 MB which seems like alot!

Is this totally unreasonable? If this were a local process I'd know how to
profile it. But I don't know how to memory profile a spark cluster. What
should I do?

One thought that occurs to me is that I can extract just the columns the I
need for the transformation and keep the rest of the columns in one big
string column. Then when the transformation is done, I can put them back
together. Is this a good idea, or is there a better way around the memory
issue?

One further thought is that quite alot of the columns are empty strings. But
it doesn't seem to make a difference to the size of the row when calculated
with SizeEstimator. 

I thought that if I could just insert the data as nulls instead of empty
strings, I could save space. so I made the fields Option[String] instead and
used None for the empty strings. The resulting Row was an even larger 2.6
MB? 

Any words of wisdom would be really appreciated!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-with-wide-289-column-dataframe-tp26651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to