I'm building a spark job against Spark 1.6.0 / EMR 4.4 in Scala. I'm attempting to concat a bunch of dataframe columns then explode them into new rows. (just using the built in concat and explode functions) Works great in my unit test.
But I get out of memory issues when I run against my production data (25GB of bzip2 compressed pipe delimited text split out into about 30 individual files): The exception I get is: ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. EMR has set spark.executor.memory = 5120M Since I don't really mind swapping to disk, I tried fiddling with the memory settings: .set("spark.memory.fraction","0.15") .set("spark.memory.storageFraction","0.15") Which does indeed help - but I still get a number of OutOfMemory problems that kill executors. I have some very wide DataFrames comprised of 289 string columns once all joined together. I used the SizeEstimator to get a read on the size of a single row and it clocked in at 2.4 MB which seems like alot! Is this totally unreasonable? If this were a local process I'd know how to profile it. But I don't know how to memory profile a spark cluster. What should I do? One thought that occurs to me is that I can extract just the columns the I need for the transformation and keep the rest of the columns in one big string column. Then when the transformation is done, I can put them back together. Is this a good idea, or is there a better way around the memory issue? One further thought is that quite alot of the columns are empty strings. But it doesn't seem to make a difference to the size of the row when calculated with SizeEstimator. I thought that if I could just insert the data as nulls instead of empty strings, I could save space. so I made the fields Option[String] instead and used None for the empty strings. The resulting Row was an even larger 2.6 MB? Any words of wisdom would be really appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-with-wide-289-column-dataframe-tp26651.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org