Yes, it happens quite often. I would add this :

-Dspark.worker.timeout=200

It depends a lot on your network/disk performance

When I did my first tries on EC2, all my workers where dropping like flies. Just pushing the timeout should be enough

Guillaume


Has anyone had this happen before?

Roshan


On Tue, Feb 4, 2014 at 9:13 PM, Roshan Nair <ros...@indix.com> wrote:
Hi,

I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some point, in that the driver UI is still available at port 4040, but the master UI says the job failed.

The job seems to always hang in a reduceByKey(or reduceByKeyLocally).

Some of my configuration parameters are - 
-Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 -Dspark.akka.askTimeout=30

The job hangs with the driver saying an executor was lost. Within a few seconds, it loses all executors. 
Some of the executors say that a SendingConnectionManagerId was not found. I've noticed that in all cases, the executors don't log for 15 seconds before they log some more and then finish.

Driver log: 
14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task 10.0:49 as 6065714 bytes in 9 ms
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6

Last log lines from first executor that was lost:
14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is StorageLevel(false, true, true, 1)
14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@469c17ba
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@469c17ba
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@4ed29790
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4ed29790
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Final few logs from another executor:
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(10.0.15.122,58862)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Thanks in advance!

Roshan



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to