I have a somewhat large job (10 GB input data but generates about 500 GB of
data after many stages).

Most tasks completed but a few stragglers on the same node/executor are
still active (but doing nothing) after about 16 hours.

At about 3 to 4 hours in, the tasks that are hanging have the following in
the work logs.

Any idea what config to tweak for this?


14/06/10 18:51:10 WARN network.ReceivingConnection: Error reading from
connection to ConnectionManagerId(172.16.25.108,37693)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:10 INFO network.ConnectionManager: Handling connection error
on connection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/06/10 18:51:10 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/06/10 18:51:14 WARN network.ReceivingConnection: Error reading from
connection to ConnectionManagerId(172.16.25.97,54918)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:14 INFO network.ConnectionManager: Handling connection error
on connection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/06/10 18:51:14 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found

-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to