I have a somewhat large job (10 GB input data but generates about 500 GB of data after many stages).
Most tasks completed but a few stragglers on the same node/executor are still active (but doing nothing) after about 16 hours. At about 3 to 4 hours in, the tasks that are hanging have the following in the work logs. Any idea what config to tweak for this? 14/06/10 18:51:10 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.108,37693) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) at sun.nio.ch.IOUtil.read(IOUtil.java:224) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534) at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) 14/06/10 18:51:10 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.108,37693) 14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693) 14/06/10 18:51:10 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.108,37693) 14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693) 14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693) 14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/06/10 18:51:14 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.97,54918) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) at sun.nio.ch.IOUtil.read(IOUtil.java:224) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534) at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) 14/06/10 18:51:14 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.97,54918) 14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918) 14/06/10 18:51:14 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.97,54918) 14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918) 14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918) 14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io