[ 
https://issues.apache.org/jira/browse/YARN-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144360#comment-16144360
 ] 

Jason Lowe commented on YARN-7110:
----------------------------------

Looks like Varun was driving that effort but may be busy with other work.  Feel 
free to ping him on that JIRA for the current status.  It makes more sense to 
keep the discussion there where there's already earlier discussion, draft of a 
proposed design, and many people watching that ticket.  Splitting the 
discussion across that ticket and here does not make sense.  Closing this as a 
duplicate.

As for the urgent need, we ran into something similar and fixed the spark 
shuffle handler.  That's the urgent fix you need today.  Migrating it out of 
the NM to a separate process doesn't really solve this particular issue.  If 
the spark shuffle handler's memory is going to explode, it just changes what 
explodes with it.  It would be nice if it just destroyed the spark handler 
instead of the NM process, but a cluster running mostly Spark is still hosed if 
none of the shuffle handlers are running.  The NM supports a work-preserving 
restart, so you could also consider placing your NMs under supervision so they 
are restarted if they crash.  When doing this you will probably want to set 
yarn.nodemanager.recovery.supervised=true to inform the NM that it can rely on 
something to restart it in a timely manner if it goes down due to an error.  
Not as preferable as fixing the problem in the spark shuffle handler directly, 
but it is an option to help your situation in the short term.


> NodeManager always crash for spark shuffle service out of memory
> ----------------------------------------------------------------
>
>                 Key: YARN-7110
>                 URL: https://issues.apache.org/jira/browse/YARN-7110
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: YunFan Zhou
>            Priority: Critical
>         Attachments: screenshot-1.png
>
>
> NM often crash due to the Spark shuffle service,  I can saw many error log 
> messages before NM crashed:
> {noformat}
> 2017-08-28 16:14:20,521 ERROR 
> org.apache.spark.network.server.TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=791888824460, 
> chunkIndex=0}, 
> buffer=FileSegmentManagedBuffer{file=/data11/hadoopdata/nodemanager/local/usercache/map_loc/appcache/application_1502793246072_2171283/blockmgr-11e2d625-8db1-477c-9365-4f6d0a7d1c48/10/shuffle_0_6_0.data,
>  offset=27063401500, length=64785602}} to /10.93.91.17:18958; closing 
> connection
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>         at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
>         at 
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
>         at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
>         at 
> org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
>         at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:92)
>         at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
>         at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761)
>         at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:317)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-08-28 16:14:20,523 ERROR 
> org.apache.spark.network.server.TransportRequestHandler: Error sending result 
> RpcResponse{requestId=7652091066050104512, 
> body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
> /10.93.91.17:18958; closing connection
> {noformat}
> Finally, there are too many *Finalizer* objects in the process of *NM* to 
> cause OOM.
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to