Hi Anybody got any pointers on this one?
Regards Sumit Chawla On Tue, Mar 6, 2018 at 8:58 AM, Chawla,Sumit <[email protected]> wrote: > No, This is the only Stack trace i get. I have tried DEBUG but didn't > notice much of a log change. > > Yes, I have tried bumping MaxDirectMemorySize to get rid of this error. > It does work if i throw 4G+ memory at it. However, I am trying to > understand this behavior so that i can setup this number to appropriate > value. > > Regards > Sumit Chawla > > > On Tue, Mar 6, 2018 at 8:07 AM, Vadim Semenov <[email protected]> wrote: > >> Do you have a trace? i.e. what's the source of `io.netty.*` calls? >> >> And have you tried bumping `-XX:MaxDirectMemorySize`? >> >> On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit <[email protected]> >> wrote: >> >>> Hi All >>> >>> I have a job which processes a large dataset. All items in the dataset >>> are unrelated. To save on cluster resources, I process these items in >>> chunks. Since chunks are independent of each other, I start and shut down >>> the spark context for each chunk. This allows me to keep DAG smaller and >>> not retry the entire DAG in case of failures. This mechanism used to work >>> fine with Spark 1.6. Now, as we have moved to 2.2, the job started >>> failing with OutOfDirectMemoryError error. >>> >>> 2018-03-03 22:00:59,687 WARN [rpc-server-48-1] >>> server.TransportChannelHandler >>> (TransportChannelHandler.java:exceptionCaught(78)) >>> - Exception in connection from /10.66.73.27:60374 >>> >>> io.netty.util.internal.OutOfDirectMemoryError: failed to allocate >>> 8388608 byte(s) of direct memory (used: 1023410176, max: 1029177344) >>> >>> at io.netty.util.internal.PlatformDependent.incrementMemoryCoun >>> ter(PlatformDependent.java:506) >>> >>> at io.netty.util.internal.PlatformDependent.allocateDirectNoCle >>> aner(PlatformDependent.java:460) >>> >>> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolAre >>> na.java:701) >>> >>> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:690) >>> >>> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) >>> >>> at io.netty.buffer.PoolArena.allocate(PoolArena.java:213) >>> >>> at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) >>> >>> at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(Poole >>> dByteBufAllocator.java:271) >>> >>> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra >>> ctByteBufAllocator.java:177) >>> >>> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra >>> ctByteBufAllocator.java:168) >>> >>> at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractBy >>> teBufAllocator.java:129) >>> >>> at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.all >>> ocate(AdaptiveRecvByteBufAllocator.java:104) >>> >>> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.re >>> ad(AbstractNioByteChannel.java:117) >>> >>> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven >>> tLoop.java:564) >>> >>> I got some clue on what is causing this from https://github.com/netty/ >>> netty/issues/6343, However I am not able to add up numbers on what is >>> causing 1 GB of Direct Memory to fill up. >>> >>> Output from jmap >>> >>> >>> 7: 22230 1422720 io.netty.buffer.PoolSubpage >>> >>> 12: 1370 804640 io.netty.buffer.PoolSubpage[] >>> >>> 41: 3600 144000 io.netty.buffer.PoolChunkList >>> >>> 98: 1440 46080 io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache >>> >>> 113: 300 40800 io.netty.buffer.PoolArena$HeapArena >>> >>> 114: 300 40800 io.netty.buffer.PoolArena$DirectArena >>> >>> 192: 198 15840 io.netty.buffer.PoolChunk >>> >>> 274: 120 8320 io.netty.buffer.PoolThreadCache$MemoryRegionCache[] >>> >>> 406: 120 3840 io.netty.buffer.PoolThreadCache$NormalMemoryRegionCache >>> >>> 422: 72 3552 io.netty.buffer.PoolArena[] >>> >>> 458: 30 2640 io.netty.buffer.PooledUnsafeDirectByteBuf >>> >>> 500: 36 2016 io.netty.buffer.PooledByteBufAllocator >>> >>> 529: 32 1792 io.netty.buffer.UnpooledUnsafeHeapByteBuf >>> >>> 589: 20 1440 io.netty.buffer.PoolThreadCache >>> >>> 630: 37 1184 io.netty.buffer.EmptyByteBuf >>> >>> 703: 36 864 io.netty.buffer.PooledByteBufAllocator$PoolThreadLocalCache >>> >>> 852: 22 528 io.netty.buffer.AdvancedLeakAwareByteBuf >>> >>> 889: 10 480 io.netty.buffer.SlicedAbstractByteBuf >>> >>> 917: 8 448 io.netty.buffer.UnpooledHeapByteBuf >>> >>> 1018: 20 320 io.netty.buffer.PoolThreadCache$1 >>> >>> 1305: 4 128 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry >>> >>> 1404: 1 80 io.netty.buffer.PooledUnsafeHeapByteBuf >>> >>> 1473: 3 72 io.netty.buffer.PoolArena$SizeClass >>> >>> 1529: 1 64 io.netty.buffer.AdvancedLeakAwareCompositeByteBuf >>> >>> 1541: 2 64 io.netty.buffer.CompositeByteBuf$Component >>> >>> 1568: 1 56 io.netty.buffer.CompositeByteBuf >>> >>> 1896: 1 32 io.netty.buffer.PoolArena$SizeClass[] >>> >>> 2042: 1 24 io.netty.buffer.PooledUnsafeDirectByteBuf$1 >>> >>> 2046: 1 24 io.netty.buffer.UnpooledByteBufAllocator >>> >>> 2051: 1 24 io.netty.buffer.PoolThreadCache$MemoryRegionCache$1 >>> >>> 2078: 1 24 io.netty.buffer.PooledHeapByteBuf$1 >>> >>> 2135: 1 24 io.netty.buffer.PooledUnsafeHeapByteBuf$1 >>> >>> 2302: 1 16 io.netty.buffer.ByteBufUtil$1 >>> >>> 2769: 1 16 io.netty.util.internal.__matchers__.io.netty.buffer.ByteBufM >>> atcher >>> >>> >>> >>> My Driver machine has 32 CPUs, and as of now i have 15 machines in my >>> cluster. As of now, the error happens on processing 5th or 6th chunk. I >>> suspect the error is dependent on number of Executors and would happen >>> early if we add more executors. >>> >>> >>> I am trying to come up an explanation of what is filling up the Direct >>> Memory and how to quanitfy it as factor of Number of Executors. Our >>> cluster is shared cluster, And we need to understand how much Driver >>> Memory to allocate for most of the jobs. >>> >>> >>> >>> >>> >>> Regards >>> Sumit Chawla >>> >>> >> >
