Looks like eventually there was some type of reset or timeout and the tasks have been reassigned. I'm guessing they'll keep failing until max failure count.
The machine it disconnected from was a remote machine, though I've seen such failures from connections to itself with other problems. The log lines from the remote machine are also below. Any thoughts or guesses would be appreciated! *"HUNG" WORKER* 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.103,57626) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) at sun.nio.ch.IOUtil.read(IOUtil.java:224) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.103,57626) 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found *REMOTE WORKER* 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > I have a flow that ends with saveAsTextFile() to HDFS. > > It seems all the expected files per partition have been written out, based > on the number of part files and the file sizes. > > But the driver logs show 2 tasks still not completed and has no activity > and the worker logs show no activity for those two tasks for a while now. > > Has anyone run into this situation? It's happened to me a couple of times > now. > > Thanks. > > -- Suren > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io