[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801484#comment-16801484 ] Moein Hosseini commented on SPARK-15544: [~kabhwan] I did some on this issue, but it's not well tested yet. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796825#comment-16796825 ] Jungtaek Lim commented on SPARK-15544: -- [~moein7tl] Do you make some progress? If you are no longer working on this I would like to step on it. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771035#comment-16771035 ] Moein Hosseini commented on SPARK-15544: [~srowen] I've started to work on it. Seems it comes from LatchLeader of Curator. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771008#comment-16771008 ] Jungtaek Lim commented on SPARK-15544: -- I'm interested in this issue, but I guess the thing is not just let Master avoid shutting down when leadership has been revoked, but also handle various situations in H/A in event handler, so it may require understanding of how Spark H/A deals with such situations as of now. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770444#comment-16770444 ] Sean Owen commented on SPARK-15544: --- [~moein7tl] can you propose a fix? this is pretty much DIY for this type of component. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770361#comment-16770361 ] Moein Hosseini commented on SPARK-15544: I've faced with this issue on Spark 2.4.0. Let say I have 3 nodes with spark-master and zookeeper and configured spark-default on all same as bleow {code:java} spark.driver.memory 3g spark.deploy.recoveryMode ZOOKEEPER spark.deploy.zookeeper.url node1:2181,node2:2181,node3:2181 spark.deploy.zookeeper.dir /spark24-ha {code} As I restart node3 which ALIVE spark master was on, spark-master on node2 was elected as leader but on node1 the spark-master went off with following error. {code:java} 2019-02-17 14:10:09 INFO ClientCnxn:1098 - Unable to read additional data from server sessionid 0x368f5d5ba050577, likely server has closed socket, closing socket connection and attempting reconnect 2019-02-17 14:10:09 INFO ConnectionStateManager:228 - State change: SUSPENDED 2019-02-17 14:10:10 INFO ZooKeeperLeaderElectionAgent:54 - We have lost leadership 2019-02-17 14:10:10 ERROR Master:70 - Leadership has been revoked -- master shutting down. {code} > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1645#comment-1645 ] agate commented on SPARK-15544: --- Can we please increase the priority of this bug since it exists in latest Spark 2.3.0 too? We have observed this during upgrade scenario (with Spark 1.6.3), where we have to shutdown zookeeper, which has the adverse side-effect of spark master shutting down on other nodes which is not very ideal. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440535#comment-16440535 ] zuotingbing commented on SPARK-15544: - cc [~vanzin] > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440511#comment-16440511 ] zuotingbing commented on SPARK-15544: - The same issue still occurs in spark 2.3.0. see [SPARK-23530|https://issues.apache.org/jira/browse/SPARK-23530] > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal >Priority: Major > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094797#comment-16094797 ] David Kats commented on SPARK-15544: Confirming the same issue with Spark 2.1.0 and 2.2.0, ubuntu 14.04, zookeeper 3.4.5 017-07-20 12:48:25,151 INFO ClientCnxn: Client session timed out, have not heard from server in 35022ms for sessionid 0x15d5fb6dc7d0009, closing socket connection and attempting reconnect 2017-07-20 12:48:25,254 INFO ConnectionStateManager: State change: SUSPENDED 2017-07-20 12:48:25,268 INFO ZooKeeperLeaderElectionAgent: We have lost leadership 2017-07-20 12:48:25,295 ERROR Master: Leadership has been revoked -- master shutting down. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762555#comment-15762555 ] Ed Tyrrill commented on SPARK-15544: I'm going to add that this is very easy to reproduce. It will happen reliably if you shut down the zookeeper node that is currently the leader. I configured systemd to automatically restart the spark master, and while the spark master process starts, the spark master on all three nodes doesn't really work, and continually tries to reconnect to zookeeper until I bring up the shutdown zookeeper node. Spark should be able to work with two of the three zookeeper nodes, but instead it log message like this repeatedly every couple seconds on all three spark master nodes until I bring back up the one zookeeper node that I shut down, zk02: 2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk01/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk01/10.0.xx.xx:, initiating session 2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk02/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk03/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk03/10.0.xx.xx:, initiating session Zookeeper itself has selected a new leader, and Kafka, which also uses zookeeper, doesn't have any trouble during this time. Also important to note, if you shut down a non-leader zookeeper node then spark doesn't have any trouble either. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762360#comment-15762360 ] Ed Tyrrill commented on SPARK-15544: I am experiencing the same problem with Spark 1.6.2 and ZK 3.4.8 on RHEL 7. Any plans on fixing this? > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389189#comment-15389189 ] Avik Sil commented on SPARK-15544: -- I am also seeing the same issue with spark 1.3.0, ubuntu 14.04, zookeeper 3.4.5 We have a 3 node cluster with spark and zookeeper. We also have a automatic restarter service which checks for the status of spark master every 5 min and restarts it if it is not running. So when the master shuts down after its leadership is revoked, the restarter service starts spark master within 5 min. But in *few cases* we don't see any ALIVE spark master in any of the three nodes - we don't see any "We have gained leadership" message in any of the 3 nodes. >From spark-defaults.conf: spark.deploy.recoveryMode ZOOKEEPER spark.deploy.zookeeper.url 192.168.42.2:28000,192.168.42.3:28000,192.168.42.4:28000 spark.deploy.recoveryDirectory /var/run/sparkmaster/df71911f-a28d-409d-977f-ea2e596ec578/recovery spark.akka.logLifecycleEvents true > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314730#comment-15314730 ] Shixiong Zhu commented on SPARK-15544: -- As a workaround, you can write a script to restart master if it dies. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org