I am using Spark 1.1.1. I am seeing an issue that only appears when I run in standalone clustered mode with at least 2 workers. The workers are on separate physical machines. I am performing a simple join on 2 RDDs. After the join I run first() on the joined RDD (in Scala) to get the first result. When this first() runs on Worker A it works fine; when the first() runs on worker B I get an error 'Fetch Failure'. I looked at the work stderr log for worker B. It shows the following exception: INFO BlockFetcherInterator$BasicBlockFetcherIterator: Started 2 remote fetches in 2 msERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(, )java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.ConnectionManager$$anon$10$$anonfun$run$15.apply(ConnectionManager.scala:866)..... It is trying to connect to the ConnectionManager for BlockManager on Worker A from Worker B. It manages to connect, but it always times out. When I try to connect via telnet I see the same: it connects, but I don't get anything back from the host I noticed that two other people reported this issue <http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-td14902.html> . Unfortunately there was no meaningful progress.
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Failed-fetch-Could-not-get-block-s-tp20262.html Sent from the Apache Spark User List mailing list archive at Nabble.com.