Hi, By dropping spark.shuffle.file.buffer.kb to 10k and using Snappy (thanks, Aaron), the job I'm trying to run is no longer OOMEing because of 300k LZF buffers taking up 4g of RAM.
But...now it's OOMEing because BlockManager is taking ~3.5gb of RAM (which is ~90% of the available heap). Specifically, it's two ConcurrentHashMaps: * BlockManager.blockInfo has ~1gb retained, AFAICT from ~5.5 million entries of (ShuffleBlockId, (BlockInfo, Long)) * BlockManager's DiskBlockManager.blockToFileSegmentMap has ~2.3gb retained, AFAICT from about the same ~5.5 million entries of (ShuffleBlockId, (FileSegment, Long)). The job stalls about 3,000 tasks through a 7,000-partition shuffle that is loading ~500gb from S3 on 5 m1.large (4gb heap) machines. The job did a few smaller ~50-partition shuffles before this larger one, but nothing crazy. It's an on-demand/EMR cluster, in standalone mode. Both of these maps are TimeStampedHashMaps, which kind of makes me shudder, but we have the cleaner disabled which AFAIK is what we want, because we aren't running long-running streaming jobs. And AFAIU if the hash map did get cleaned up mid-shuffle, lookups would just start failing (which was actually happening for this job on Spark 0.7 and is what prompted us to get around to trying Spark 0.8). So, I haven't really figured out BlockManager yet--any hints on what we could do here? More machines? Should there really be this many entries in it for a shuffle of this size? I know 5 machines/4gb of RAM isn't a lot, and I could use more if needed, but I just expected the job to go slower, not OOME. Also, I technically have a heap dump from a m1.xlarge (~15gb of RAM) cluster that also OOMEd on the same job, but I can't open the file on my laptop, so I can't tell if it was OOMEing for this issue or another one (it was not using snappy, but using 10kb file buffers, so I'm interested to see what happened to it.) - Stephen
