[ https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439614#comment-16439614 ]
BELUGA BEHR commented on SPARK-23981: ------------------------------------- Or maybe lower per-block logging and debug and produce one over-all logging message if fetches cannot be completed. > ShuffleBlockFetcherIterator - Spamming Logs > ------------------------------------------- > > Key: SPARK-23981 > URL: https://issues.apache.org/jira/browse/SPARK-23981 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 2.3.0 > Reporter: BELUGA BEHR > Priority: Major > > If a remote host shuffle service fails, Spark Executors produce a huge amount > of logging. > {code:java} > 2018-04-10 20:24:44,834 INFO [Block Fetch Retry-1] > shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - > Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms > 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] > storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to > get block(s) from myhost.local:7337 > java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.ConnectException: Connection refused: > myhost.local/12.13.14.15:7337 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > {code} > > We can see from the code, that if a block fetch fails, a "listener" is > updated once for each block. From the error messages previously, it can be > seen that 1753 blocks were being fetched. However, since the remote host has > become unavailable, they all fail and every block is alerted on. > > {code:java|title=RetryingBlockFetcher.java} > if (shouldRetry(e)) { > initiateRetry(); > } else { > for (String bid : blockIdsToFetch) { > listener.onBlockFetchFailure(bid, e); > } > } > {code} > {code:java|title=ShuffleBlockFetcherIterator.scala} > override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = { > logError(s"Failed to get block(s) from > ${req.address.host}:${req.address.port}", e) > results.put(new FailureFetchResult(BlockId(blockId), address, e)) > } > {code} > So what we get here, is 1753 ERROR stack traces in the logging all printing > the same message: > {quote}Failed to get block(s) from myhost.local:7337 > ... > {quote} > Perhaps it would be better if the method signature {{onBlockFetchFailure}} > was changed to accept an entire Collection of block IDs instead of one-by-one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org