Looks like eventually there was some type of reset or timeout and the tasks
have been reassigned. I'm guessing they'll keep failing until max failure
count.

The machine it disconnected from was a remote machine, though I've seen
such failures from connections to itself with other problems. The log lines
from the remote machine are also below.

Any thoughts or guesses would be appreciated!

*"HUNG" WORKER*

14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
connection to ConnectionManagerId(172.16.25.103,57626)

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)

at sun.nio.ch.IOUtil.read(IOUtil.java:224)

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)

at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)

at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:679)

14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection error
on connection to ConnectionManagerId(172.16.25.103,57626)

14/06/18 19:41:18 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)

14/06/18 19:41:18 INFO network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(172.16.25.103,57626)

14/06/18 19:41:18 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)

14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found


*REMOTE WORKER*

14/06/18 19:41:18 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)

14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found



On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> I have a flow that ends with saveAsTextFile() to HDFS.
>
> It seems all the expected files per partition have been written out, based
> on the number of part files and the file sizes.
>
> But the driver logs show 2 tasks still not completed and has no activity
> and the worker logs show no activity for those two tasks for a while now.
>
> Has anyone run into this situation? It's happened to me a couple of times
> now.
>
> Thanks.
>
> -- Suren
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to