[ https://issues.apache.org/jira/browse/YARN-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144360#comment-16144360 ]
Jason Lowe commented on YARN-7110: ---------------------------------- Looks like Varun was driving that effort but may be busy with other work. Feel free to ping him on that JIRA for the current status. It makes more sense to keep the discussion there where there's already earlier discussion, draft of a proposed design, and many people watching that ticket. Splitting the discussion across that ticket and here does not make sense. Closing this as a duplicate. As for the urgent need, we ran into something similar and fixed the spark shuffle handler. That's the urgent fix you need today. Migrating it out of the NM to a separate process doesn't really solve this particular issue. If the spark shuffle handler's memory is going to explode, it just changes what explodes with it. It would be nice if it just destroyed the spark handler instead of the NM process, but a cluster running mostly Spark is still hosed if none of the shuffle handlers are running. The NM supports a work-preserving restart, so you could also consider placing your NMs under supervision so they are restarted if they crash. When doing this you will probably want to set yarn.nodemanager.recovery.supervised=true to inform the NM that it can rely on something to restart it in a timely manner if it goes down due to an error. Not as preferable as fixing the problem in the spark shuffle handler directly, but it is an option to help your situation in the short term. > NodeManager always crash for spark shuffle service out of memory > ---------------------------------------------------------------- > > Key: YARN-7110 > URL: https://issues.apache.org/jira/browse/YARN-7110 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: YunFan Zhou > Priority: Critical > Attachments: screenshot-1.png > > > NM often crash due to the Spark shuffle service, I can saw many error log > messages before NM crashed: > {noformat} > 2017-08-28 16:14:20,521 ERROR > org.apache.spark.network.server.TransportRequestHandler: Error sending result > ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=791888824460, > chunkIndex=0}, > buffer=FileSegmentManagedBuffer{file=/data11/hadoopdata/nodemanager/local/usercache/map_loc/appcache/application_1502793246072_2171283/blockmgr-11e2d625-8db1-477c-9365-4f6d0a7d1c48/10/shuffle_0_6_0.data, > offset=27063401500, length=64785602}} to /10.93.91.17:18958; closing > connection > java.io.IOException: Broken pipe > at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) > at > sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) > at > sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) > at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608) > at > org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:92) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:317) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 2017-08-28 16:14:20,523 ERROR > org.apache.spark.network.server.TransportRequestHandler: Error sending result > RpcResponse{requestId=7652091066050104512, > body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to > /10.93.91.17:18958; closing connection > {noformat} > Finally, there are too many *Finalizer* objects in the process of *NM* to > cause OOM. > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org