[
https://issues.apache.org/jira/browse/YARN-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144360#comment-16144360
]
Jason Lowe commented on YARN-7110:
----------------------------------
Looks like Varun was driving that effort but may be busy with other work. Feel
free to ping him on that JIRA for the current status. It makes more sense to
keep the discussion there where there's already earlier discussion, draft of a
proposed design, and many people watching that ticket. Splitting the
discussion across that ticket and here does not make sense. Closing this as a
duplicate.
As for the urgent need, we ran into something similar and fixed the spark
shuffle handler. That's the urgent fix you need today. Migrating it out of
the NM to a separate process doesn't really solve this particular issue. If
the spark shuffle handler's memory is going to explode, it just changes what
explodes with it. It would be nice if it just destroyed the spark handler
instead of the NM process, but a cluster running mostly Spark is still hosed if
none of the shuffle handlers are running. The NM supports a work-preserving
restart, so you could also consider placing your NMs under supervision so they
are restarted if they crash. When doing this you will probably want to set
yarn.nodemanager.recovery.supervised=true to inform the NM that it can rely on
something to restart it in a timely manner if it goes down due to an error.
Not as preferable as fixing the problem in the spark shuffle handler directly,
but it is an option to help your situation in the short term.
> NodeManager always crash for spark shuffle service out of memory
> ----------------------------------------------------------------
>
> Key: YARN-7110
> URL: https://issues.apache.org/jira/browse/YARN-7110
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: YunFan Zhou
> Priority: Critical
> Attachments: screenshot-1.png
>
>
> NM often crash due to the Spark shuffle service, I can saw many error log
> messages before NM crashed:
> {noformat}
> 2017-08-28 16:14:20,521 ERROR
> org.apache.spark.network.server.TransportRequestHandler: Error sending result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=791888824460,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/data11/hadoopdata/nodemanager/local/usercache/map_loc/appcache/application_1502793246072_2171283/blockmgr-11e2d625-8db1-477c-9365-4f6d0a7d1c48/10/shuffle_0_6_0.data,
> offset=27063401500, length=64785602}} to /10.93.91.17:18958; closing
> connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
> at
> org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
> at
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:92)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254)
> at
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:317)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 2017-08-28 16:14:20,523 ERROR
> org.apache.spark.network.server.TransportRequestHandler: Error sending result
> RpcResponse{requestId=7652091066050104512,
> body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to
> /10.93.91.17:18958; closing connection
> {noformat}
> Finally, there are too many *Finalizer* objects in the process of *NM* to
> cause OOM.
> !screenshot-1.png!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]