Re: oome from blockmanager

Josh Rosen Sat, 26 Oct 2013 13:43:22 -0700

Off the top of my head, there are a few constant-factor optimizations that
we could make to the blockInfo data structure that might reduce its
per-entry memory usage:


- Key the blockInfo HashMap by BlockId.name instead of BlockId.
- Exploit hierarchy/structure when storing info on shuffle blocks, e.g. Map
from ShuffleId -> MapId -> Array indexed by ReduceId -> Info.  This reduces
the number of entries stored in TimestampedHashMap, where each entry has a
bit of create the map entry objects and record the timestamps.  With this
scheme, all block info for a particular shuffle will be evicted at the same
time, reducing the amount of objects that the cleaner thread needs to scan.
- We might be able to eliminate the "pending" and "failed" boolean fields
by storing negative values in the "size" field.



On Sat, Oct 26, 2013 at 11:43 AM, Stephen Haberman <
[email protected]> wrote:

> Hi,
>
> By dropping spark.shuffle.file.buffer.kb to 10k and using Snappy
> (thanks, Aaron), the job I'm trying to run is no longer OOMEing because
> of 300k LZF buffers taking up 4g of RAM.
>
> But...now it's OOMEing because BlockManager is taking ~3.5gb of RAM
> (which is ~90% of the available heap).
>
> Specifically, it's two ConcurrentHashMaps:
>
> * BlockManager.blockInfo has ~1gb retained, AFAICT from ~5.5 million
>   entries of (ShuffleBlockId, (BlockInfo, Long))
>
> * BlockManager's DiskBlockManager.blockToFileSegmentMap has ~2.3gb
>   retained, AFAICT from about the same ~5.5 million entries of
>   (ShuffleBlockId, (FileSegment, Long)).
>
> The job stalls about 3,000 tasks through a 7,000-partition shuffle that
> is loading ~500gb from S3 on 5 m1.large (4gb heap) machines. The job
> did a few smaller ~50-partition shuffles before this larger one, but
> nothing crazy. It's an on-demand/EMR cluster, in standalone mode.
>
> Both of these maps are TimeStampedHashMaps, which kind of makes me
> shudder, but we have the cleaner disabled which AFAIK is what we want,
> because we aren't running long-running streaming jobs. And AFAIU if the
> hash map did get cleaned up mid-shuffle, lookups would just start
> failing (which was actually happening for this job on Spark 0.7 and is
> what prompted us to get around to trying Spark 0.8).
>
> So, I haven't really figured out BlockManager yet--any hints on what we
> could do here? More machines? Should there really be this many entries
> in it for a shuffle of this size?
>
> I know 5 machines/4gb of RAM isn't a lot, and I could use more if
> needed, but I just expected the job to go slower, not OOME.
>
> Also, I technically have a heap dump from a m1.xlarge (~15gb of RAM)
> cluster that also OOMEd on the same job, but I can't open the file on
> my laptop, so I can't tell if it was OOMEing for this issue or another
> one (it was not using snappy, but using 10kb file buffers, so I'm
> interested to see what happened to it.)
>
> - Stephen
>
>

Re: oome from blockmanager

Reply via email to