[ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30512.
-----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

this could be pulled back into branch-2.X as well

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-30512
>                 URL: https://issues.apache.org/jira/browse/SPARK-30512
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.0.0
>            Reporter: Chandni Singh
>            Priority: Major
>             Fix For: 3.0.0
>
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
>     EventLoopGroup bossGroup =
>       NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
>     EventLoopGroup workerGroup = bossGroup;
>     bootstrap = new ServerBootstrap()
>       .group(bossGroup, workerGroup)
>       .channel(NettyUtils.getServerChannelClass(ioMode))
>       .option(ChannelOption.ALLOCATOR, allocator)
>       .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
>     EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>       conf.getModuleName() + "-boss");
>     EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
>     conf.getModuleName() + "-server");
>     bootstrap = new ServerBootstrap()
>       .group(bossGroup, workerGroup)
>       .channel(NettyUtils.getServerChannelClass(ioMode))
>       .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to