Hi guys, > When you do a shuffle form N map partitions to M reduce partitions, there are > N * M output blocks created and each one is tracked. That's why all these > per-block overheads are causing you to OOM.
So, I'm peering into a heap dump from a 18,000-partition shuffle and thought I would just mention what I'm seeing. The problem AFAICT is that when performing a single ShuffleMapTask (or two, since we have two executors), ShuffleBlockManager opens & buffers 18,000 files. (So I believe this is slightly different than the 18,000*18,000 = total blocks in BlockManager issue before.) Each DiskBlockObjectWriter takes ~108k--80k of which is from the SnappyOutputStream (two 30k byte array buffers) and 25k in the objOut KryoSerializationStream. At 18,000 DiskBlockObjectWriters * 108k * 2 executors, that = 3.8gb. We have a 5.5gb Xmx setting (using m1.larges and reserving 2gb for the OS, worker, etc.), but I can't quite tell where the other ~2gb went. So, this is a naive question, but what if I just turned off compression? I don't really know how much disk space it's saving, but these buffers take up a non-trivial amount of space. Tangentially, there are 2.8 million kryo.Registration objects, which is 13% of the total objects; I understand that each file gets its own Kryo stream, which probably gets its own set of config objects. It makes sense in the small, but 18,000 (36,000 if counting both executors) Kryo streams leads to a lot of these. It's not the core issue, but just mentioning it. So, I've already got the job rerunning with less partitions, but just thought I'd send this as an FYI to see if it sparked any potential optimizations. Thanks! - Stephen
