BELUGA BEHR created SPARK-23981:
-----------------------------------

             Summary: ShuffleBlockFetcherIterator - Spamming Logs
                 Key: SPARK-23981
                 URL: https://issues.apache.org/jira/browse/SPARK-23981
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.3.0
            Reporter: BELUGA BEHR


If a remote host shuffle service fails, Spark Executors produce a huge amount 
of logging.
{code:java}
2018-04-10 20:24:44,834 INFO  [Block Fetch Retry-1] 
shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - 
Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms

2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] 
storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to 
get block(s) from myhost.local:7337
java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
        at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: 
myhost.local/12.13.14.15:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more
{code}
 

We can see from the code, that if a block fetch fails, a "listener" is updated 
once for each block. From the error messages previously, it can be seen that 
1753 blocks were being fetched. However, since the remote host has become 
unavailable, they all fail and every block is alerted on.

 
{code:java|title=RetryingBlockFetcher.java}
      if (shouldRetry(e)) {
        initiateRetry();
      } else {
        for (String bid : blockIdsToFetch) {
          listener.onBlockFetchFailure(bid, e);
        }
      }
{code}
{code:java|title=ShuffleBlockFetcherIterator.scala}
    override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
        logError(s"Failed to get block(s) from 
${req.address.host}:${req.address.port}", e)
        results.put(new FailureFetchResult(BlockId(blockId), address, e))
      }
{code}
So what we get here, is 1753 ERROR stack traces in the logging all printing the 
same message:
{quote}Failed to get block(s) from myhost.local:7337
 ...
{quote}

Perhaps it would be better if the method signature {{onBlockFetchFailure}} was 
changed to accept an entire Collection of block IDs instead of one-by-one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to