I have a spark cluster on mesos and when I run long running GraphX processing I receive a lot of the following two errors and one by one my slaves stop doing any work for the process until its idle. Any idea what is happening?
First type of error message: INFO SendingConnection: Initiating connection INFO SendingConnection: Connected INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@e2e30ea INFO ConnectionManager: Removing SendingConnection INFO ConnectionManager: Removing ReceivingConnection INFO ConnectionManager: Removing SendingConnection INFO ConnectionManager: Removing ReceivingConnection ERROR ConnectionManager: Corresponding SendingConnection ERROR ConnectionManager: Corresponding SendingConnection INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@e2e30ea java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@1968265a INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@1968265a java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) INFO BlockManager: Removing broadcast 95 INFO BlockManager: Removing broadcast 96 INFO BlockManager: Removing broadcast 98 INFO BlockManager: Removing broadcast 101 Second error message: group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ... slave.cpp:508] Slave asked to shut down by master@10...:5050 because 'health check timed out' slave.cpp:1406] Asked to shut down framework by master@10...:5050 slave.cpp:1431] Shutting down framework slave.cpp:2878] Shutting down executor slave.cpp:3053] Current usage 35.12%. Max allowed age: 3.841638564773842days group.cpp:472] ZooKeeper session expired detector.cpp:138] Detected a new leader: None slave.cpp:582] Lost leading master slave.cpp:636] Detecting new master group.cpp:313] Group process (group(1)@10...:5051) connected to ZooKeeper group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) group.cpp:385] Trying to create path '/mesos' in ZooKeeper detector.cpp:138] Detected a new leader: (id='16') slave.cpp:2948] Killing executor containerizer.cpp:882] Destroying container group.cpp:658] Trying to get '/mesos/info_0000000016' in ZooKeeper detector.cpp:426] A new leading master (UPID=master@10...:5050) is detected slave.cpp:589] New master detected at master@10...:5050 slave.cpp:596] Skipping registration because slave is terminating I'm on Spark 1.1 with Mesos 0.20.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problems-with-ZooKeeper-and-key-canceled-tp16541.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org