Re: oome from blockmanager

Stephen Haberman Thu, 21 Nov 2013 14:33:23 -0800

Hi guys,

> When you do a shuffle form N map partitions to M reduce partitions, there are
> N * M output blocks created and each one is tracked. That's why all these
> per-block overheads are causing you to OOM.


So, I'm peering into a heap dump from a 18,000-partition shuffle and thought
I would just mention what I'm seeing.

The problem AFAICT is that when performing a single ShuffleMapTask (or two,
since we have two executors), ShuffleBlockManager opens & buffers 18,000 files.
(So I believe this is slightly different than the 18,000*18,000 = total blocks
in BlockManager issue before.)

Each DiskBlockObjectWriter takes ~108k--80k of which is from the
SnappyOutputStream (two 30k byte array buffers) and 25k in the objOut
KryoSerializationStream.

At 18,000 DiskBlockObjectWriters * 108k * 2 executors, that = 3.8gb. We
have a 5.5gb Xmx setting (using m1.larges and reserving 2gb for the OS,
worker, etc.), but I can't quite tell where the other ~2gb went.

So, this is a naive question, but what if I just turned off
compression? I don't really know how much disk space it's saving, but
these buffers take up a non-trivial amount of space.

Tangentially, there are 2.8 million kryo.Registration objects, which is
13% of the total objects; I understand that each file gets its own Kryo
stream, which probably gets its own set of config objects. It makes
sense in the small, but 18,000 (36,000 if counting both executors) Kryo
streams leads to a lot of these. It's not the core issue, but just
mentioning it.

So, I've already got the job rerunning with less partitions, but just
thought I'd send this as an FYI to see if it sparked any potential
optimizations.

Thanks!

- Stephen

Re: oome from blockmanager

Reply via email to