Off the top of my head, there are a few constant-factor optimizations that we could make to the blockInfo data structure that might reduce its per-entry memory usage:
- Key the blockInfo HashMap by BlockId.name instead of BlockId. - Exploit hierarchy/structure when storing info on shuffle blocks, e.g. Map from ShuffleId -> MapId -> Array indexed by ReduceId -> Info. This reduces the number of entries stored in TimestampedHashMap, where each entry has a bit of create the map entry objects and record the timestamps. With this scheme, all block info for a particular shuffle will be evicted at the same time, reducing the amount of objects that the cleaner thread needs to scan. - We might be able to eliminate the "pending" and "failed" boolean fields by storing negative values in the "size" field. On Sat, Oct 26, 2013 at 11:43 AM, Stephen Haberman < [email protected]> wrote: > Hi, > > By dropping spark.shuffle.file.buffer.kb to 10k and using Snappy > (thanks, Aaron), the job I'm trying to run is no longer OOMEing because > of 300k LZF buffers taking up 4g of RAM. > > But...now it's OOMEing because BlockManager is taking ~3.5gb of RAM > (which is ~90% of the available heap). > > Specifically, it's two ConcurrentHashMaps: > > * BlockManager.blockInfo has ~1gb retained, AFAICT from ~5.5 million > entries of (ShuffleBlockId, (BlockInfo, Long)) > > * BlockManager's DiskBlockManager.blockToFileSegmentMap has ~2.3gb > retained, AFAICT from about the same ~5.5 million entries of > (ShuffleBlockId, (FileSegment, Long)). > > The job stalls about 3,000 tasks through a 7,000-partition shuffle that > is loading ~500gb from S3 on 5 m1.large (4gb heap) machines. The job > did a few smaller ~50-partition shuffles before this larger one, but > nothing crazy. It's an on-demand/EMR cluster, in standalone mode. > > Both of these maps are TimeStampedHashMaps, which kind of makes me > shudder, but we have the cleaner disabled which AFAIK is what we want, > because we aren't running long-running streaming jobs. And AFAIU if the > hash map did get cleaned up mid-shuffle, lookups would just start > failing (which was actually happening for this job on Spark 0.7 and is > what prompted us to get around to trying Spark 0.8). > > So, I haven't really figured out BlockManager yet--any hints on what we > could do here? More machines? Should there really be this many entries > in it for a shuffle of this size? > > I know 5 machines/4gb of RAM isn't a lot, and I could use more if > needed, but I just expected the job to go slower, not OOME. > > Also, I technically have a heap dump from a m1.xlarge (~15gb of RAM) > cluster that also OOMEd on the same job, but I can't open the file on > my laptop, so I can't tell if it was OOMEing for this issue or another > one (it was not using snappy, but using 10kb file buffers, so I'm > interested to see what happened to it.) > > - Stephen > >
