I have a spark cluster on mesos and when I run long running GraphX processing
I receive a lot of the following two errors and one by one my slaves stop
doing any work for the process until its idle. Any idea what is happening?

First type of error message:

INFO SendingConnection: Initiating connection 
INFO SendingConnection: Connected 
INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@e2e30ea
INFO ConnectionManager: Removing SendingConnection 
INFO ConnectionManager: Removing ReceivingConnection 
INFO ConnectionManager: Removing SendingConnection
INFO ConnectionManager: Removing ReceivingConnection 
ERROR ConnectionManager: Corresponding SendingConnection 
ERROR ConnectionManager: Corresponding SendingConnection 
INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@e2e30ea
java.nio.channels.CancelledKeyException
        at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
        at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@1968265a
INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@1968265a
java.nio.channels.CancelledKeyException
        at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
        at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
INFO BlockManager: Removing broadcast 95
INFO BlockManager: Removing broadcast 96
INFO BlockManager: Removing broadcast 98
INFO BlockManager: Removing broadcast 101


Second error message:

group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
slave.cpp:508] Slave asked to shut down by master@10...:5050 because 'health
check timed out'
slave.cpp:1406] Asked to shut down framework  by master@10...:5050
slave.cpp:1431] Shutting down framework 
slave.cpp:2878] Shutting down executor
slave.cpp:3053] Current usage 35.12%. Max allowed age: 3.841638564773842days
group.cpp:472] ZooKeeper session expired
detector.cpp:138] Detected a new leader: None
slave.cpp:582] Lost leading master
slave.cpp:636] Detecting new master
group.cpp:313] Group process (group(1)@10...:5051) connected to ZooKeeper
group.cpp:787] Syncing group operations: queue size (joins, cancels, datas)
= (0, 0, 0)
group.cpp:385] Trying to create path '/mesos' in ZooKeeper
detector.cpp:138] Detected a new leader: (id='16')
slave.cpp:2948] Killing executor 
containerizer.cpp:882] Destroying container 
group.cpp:658] Trying to get '/mesos/info_0000000016' in ZooKeeper
detector.cpp:426] A new leading master (UPID=master@10...:5050) is detected
slave.cpp:589] New master detected at master@10...:5050
slave.cpp:596] Skipping registration because slave is terminating


I'm on Spark 1.1 with Mesos 0.20.1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problems-with-ZooKeeper-and-key-canceled-tp16541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to