[ 
https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439614#comment-16439614
 ] 

BELUGA BEHR commented on SPARK-23981:
-------------------------------------

Or maybe lower per-block logging and debug and produce one over-all logging 
message if fetches cannot be completed.

> ShuffleBlockFetcherIterator - Spamming Logs
> -------------------------------------------
>
>                 Key: SPARK-23981
>                 URL: https://issues.apache.org/jira/browse/SPARK-23981
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.3.0
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> If a remote host shuffle service fails, Spark Executors produce a huge amount 
> of logging.
> {code:java}
> 2018-04-10 20:24:44,834 INFO  [Block Fetch Retry-1] 
> shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - 
> Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms
> 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] 
> storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to 
> get block(s) from myhost.local:7337
> java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
>       at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>       at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>       at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection refused: 
> myhost.local/12.13.14.15:7337
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>       at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>       ... 1 more
> {code}
>  
> We can see from the code, that if a block fetch fails, a "listener" is 
> updated once for each block. From the error messages previously, it can be 
> seen that 1753 blocks were being fetched. However, since the remote host has 
> become unavailable, they all fail and every block is alerted on.
>  
> {code:java|title=RetryingBlockFetcher.java}
>       if (shouldRetry(e)) {
>         initiateRetry();
>       } else {
>         for (String bid : blockIdsToFetch) {
>           listener.onBlockFetchFailure(bid, e);
>         }
>       }
> {code}
> {code:java|title=ShuffleBlockFetcherIterator.scala}
>     override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
>         logError(s"Failed to get block(s) from 
> ${req.address.host}:${req.address.port}", e)
>         results.put(new FailureFetchResult(BlockId(blockId), address, e))
>       }
> {code}
> So what we get here, is 1753 ERROR stack traces in the logging all printing 
> the same message:
> {quote}Failed to get block(s) from myhost.local:7337
>  ...
> {quote}
> Perhaps it would be better if the method signature {{onBlockFetchFailure}} 
> was changed to accept an entire Collection of block IDs instead of one-by-one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to