[jira] [Commented] (SPARK-23981) ShuffleBlockFetcherIterator - Spamming Logs

2018-04-16 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439614#comment-16439614
 ] 

BELUGA BEHR commented on SPARK-23981:
-

Or maybe lower per-block logging and debug and produce one over-all logging 
message if fetches cannot be completed.

> ShuffleBlockFetcherIterator - Spamming Logs
> ---
>
> Key: SPARK-23981
> URL: https://issues.apache.org/jira/browse/SPARK-23981
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> If a remote host shuffle service fails, Spark Executors produce a huge amount 
> of logging.
> {code:java}
> 2018-04-10 20:24:44,834 INFO  [Block Fetch Retry-1] 
> shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - 
> Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms
> 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] 
> storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to 
> get block(s) from myhost.local:7337
> java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection refused: 
> myhost.local/12.13.14.15:7337
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> {code}
>  
> We can see from the code, that if a block fetch fails, a "listener" is 
> updated once for each block. From the error messages previously, it can be 
> seen that 1753 blocks were being fetched. However, since the remote host has 
> become unavailable, they all fail and every block is alerted on.
>  
> {code:java|title=RetryingBlockFetcher.java}
>   if (shouldRetry(e)) {
> initiateRetry();
>   } else {
> for (String bid : blockIdsToFetch) {
>   listener.onBlockFetchFailure(bid, e);
> }
>   }
> {code}
> {code:java|title=ShuffleBlockFetcherIterator.scala}
> override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
> logError(s"Failed to get block(s) from 
> ${req.address.host}:${req.address.port}", e)
> results.put(new FailureFetchResult(BlockId(blockId), address, e))
>   }
> {code}
> So what we get here, is 1753 ERROR stack traces in the logging all printing 
> the same message:
> {quote}Failed to get block(s) from myhost.local:7337
>  ...
> {quote}
> Perhaps it would be better if the method signature {{onBlockFetchFailure}} 
> was changed to accept an entire Collection of block IDs instead of one-by-one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23994) Add Host To Blacklist If Shuffle Cannot Complete

2018-04-16 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created SPARK-23994:
---

 Summary: Add Host To Blacklist If Shuffle Cannot Complete
 Key: SPARK-23994
 URL: https://issues.apache.org/jira/browse/SPARK-23994
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Shuffle
Affects Versions: 2.3.0
Reporter: BELUGA BEHR


If a node cannot be reached for shuffling data, add the node to the blacklist 
and retry the current stage.

{code:java}
2018-04-10 20:25:55,065 ERROR [Block Fetch Retry-3] 
shuffle.RetryingBlockFetcher 
(RetryingBlockFetcher.java:fetchAllOutstanding(142)) - Exception while 
beginning fetch of 711 outstanding blocks (after 3 retries)
java.io.IOException: Failed to connect to host.local/10.11.12.13:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: 
host.local/10.11.12.13:7337
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23981) ShuffleBlockFetcherIterator - Spamming Logs

2018-04-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created SPARK-23981:
---

 Summary: ShuffleBlockFetcherIterator - Spamming Logs
 Key: SPARK-23981
 URL: https://issues.apache.org/jira/browse/SPARK-23981
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.3.0
Reporter: BELUGA BEHR


If a remote host shuffle service fails, Spark Executors produce a huge amount 
of logging.
{code:java}
2018-04-10 20:24:44,834 INFO  [Block Fetch Retry-1] 
shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - 
Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms

2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] 
storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to 
get block(s) from myhost.local:7337
java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: 
myhost.local/12.13.14.15:7337
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
{code}
 

We can see from the code, that if a block fetch fails, a "listener" is updated 
once for each block. From the error messages previously, it can be seen that 
1753 blocks were being fetched. However, since the remote host has become 
unavailable, they all fail and every block is alerted on.

 
{code:java|title=RetryingBlockFetcher.java}
  if (shouldRetry(e)) {
initiateRetry();
  } else {
for (String bid : blockIdsToFetch) {
  listener.onBlockFetchFailure(bid, e);
}
  }
{code}
{code:java|title=ShuffleBlockFetcherIterator.scala}
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from 
${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), address, e))
  }
{code}
So what we get here, is 1753 ERROR stack traces in the logging all printing the 
same message:
{quote}Failed to get block(s) from myhost.local:7337
 ...
{quote}

Perhaps it would be better if the method signature {{onBlockFetchFailure}} was 
changed to accept an entire Collection of block IDs instead of one-by-one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org