[jira] [Commented] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-31 Thread Saloni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256965#comment-17256965
 ] 

Saloni commented on SPARK-33943:


If we increase the timeouts/no. of retries, will that resolve the issue i.e. 
will it ensure that the ZooKeeper LeaderElection Agent is called?

Because, the crux of it boils down to understanding why after successful 
establishment of the session, LeaderElection Agent is not called.

> Zookeeper LeaderElection Agent not being called by Spark Master
> ---
>
> Key: SPARK-33943
> URL: https://issues.apache.org/jira/browse/SPARK-33943
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
>  Operating System - RHEL 6.10
>Reporter: Saloni
>Priority: Major
>
> I have 2 spark masters and 3 zookeepers deployed on my system on separate 
> virtual machines. I am using spark in standalone mode.
> The services come up online in the below sequence:
>  # zookeeper-1
>  # sparkmaster-1
>  # sparkmaster-2
>  # zookeeper-2
>  # zookeeper-3
> The above sequence leads to both the spark masters running in STANDBY mode.
> From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
> zookeeper services are up), spark master is successfully able to create a 
> zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
> However, after both zookeeper services are up and Persistence Engine is able 
> to successfully connect and create a session; *the ZooKeeper LeaderElection 
> Agent is not called*.
> Logs (spark-master.log):
> {code:java}
> 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
> state to ZooKeeper Initiating client connection, 
> connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
> sessionTimeout=6 watcher=org.apache.curator.ConnectionState
> # Only zookeeper-2 is online #
> 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-1:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-1:: No route to host
> 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-2:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-2:, initiating session
> 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect
> 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-3:: Connection refused 
> 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (15274)
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (35297) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
> ...
> ...
> ...
> 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (55301) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
> unsuccessful after 105305 (greater than max timeout of 6). Resetting 
> connection and trying again with a new connection. 
> 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 c

[jira] [Commented] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-31 Thread Saloni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256962#comment-17256962
 ] 

Saloni commented on SPARK-33943:


As per my understanding, the re-tries going on in the logs for establishing a 
successful zookeeper session are for 'Persisting recovery state to ZooKeeper'.
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
state to ZooKeeper Initiating client connection, 
connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState
{code}
Once this is successfully established, then the Zookeeper LeaderElection Agent 
is ideally called.

The last lines in the log state that a session was successfully created, it 
seems this was for the Persistence Engine (since for this the connection was 
initiated).

 
{code:java}
10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-3:, initiating session 
10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session 
establishment complete on server zookeeper-3:, sessionid = , negotiated 
timeout = 4 
10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 
- State change: CONNECTED
{code}
What I don't understand is why the Zookeeper LeaderElection Agent was not 
called if sparkMaster was able to connect to the zookeepers?

 

> Zookeeper LeaderElection Agent not being called by Spark Master
> ---
>
> Key: SPARK-33943
> URL: https://issues.apache.org/jira/browse/SPARK-33943
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
>  Operating System - RHEL 6.10
>Reporter: Saloni
>Priority: Major
>
> I have 2 spark masters and 3 zookeepers deployed on my system on separate 
> virtual machines. I am using spark in standalone mode.
> The services come up online in the below sequence:
>  # zookeeper-1
>  # sparkmaster-1
>  # sparkmaster-2
>  # zookeeper-2
>  # zookeeper-3
> The above sequence leads to both the spark masters running in STANDBY mode.
> From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
> zookeeper services are up), spark master is successfully able to create a 
> zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
> However, after both zookeeper services are up and Persistence Engine is able 
> to successfully connect and create a session; *the ZooKeeper LeaderElection 
> Agent is not called*.
> Logs (spark-master.log):
> {code:java}
> 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
> state to ZooKeeper Initiating client connection, 
> connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
> sessionTimeout=6 watcher=org.apache.curator.ConnectionState
> # Only zookeeper-2 is online #
> 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-1:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-1:: No route to host
> 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-2:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-2:, initiating session
> 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect
> 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-3:: Connection refused 
> 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (15274)
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-

[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-30 Thread Saloni (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saloni updated SPARK-33943:
---
Description: 
I have 2 spark masters and 3 zookeepers deployed on my system on separate 
virtual machines. I am using spark in standalone mode.

The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

>From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
>zookeeper services are up), spark master is successfully able to create a 
>zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
>However, after both zookeeper services are up and Persistence Engine is able 
>to successfully connect and create a session; *the ZooKeeper LeaderElection 
>Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
state to ZooKeeper Initiating client connection, 
connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState

# Only zookeeper-2 is online #

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-1:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-1:: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-3:: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (35297) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (55301) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
unsuccessful after 105305 (greater than max timeout of 6). Resetting 
connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread 
shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /x/y 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

# zookeeper-2, zookeeper-3 are online # 

10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:. Will not attempt to authenticate 
using SASL (unknown error) 
10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection establ

[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-30 Thread Saloni (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saloni updated SPARK-33943:
---
Environment: 
2 Spark Masters KVMs and 3 Zookeeper KVMs.
 Operating System - RHEL 6.10

  was:
2 Spark Masters KVMs and 3 Zookeeper KVMs.
Operating System - RHEL 6.6


> Zookeeper LeaderElection Agent not being called by Spark Master
> ---
>
> Key: SPARK-33943
> URL: https://issues.apache.org/jira/browse/SPARK-33943
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
>  Operating System - RHEL 6.10
>Reporter: Saloni
>Priority: Major
>
> I have 2 spark masters and 3 zookeepers deployed on my system on separate 
> virtual machines. The services come up online in the below sequence:
>  # zookeeper-1
>  # sparkmaster-1
>  # sparkmaster-2
>  # zookeeper-2
>  # zookeeper-3
> The above sequence leads to both the spark masters running in STANDBY mode.
> From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
> zookeeper services are up), spark master is successfully able to create a 
> zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
> However, after both zookeeper services are up and Persistence Engine is able 
> to successfully connect and create a session; *the ZooKeeper LeaderElection 
> Agent is not called*.
> Logs (spark-master.log):
> {code:java}
> 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
> state to ZooKeeper Initiating client connection, 
> connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
> sessionTimeout=6 watcher=org.apache.curator.ConnectionState
> # Only zookeeper-2 is online #
> 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-1:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-1:: No route to host
> 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-2:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-2:, initiating session
> 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect
> 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-3:: Connection refused 
> 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (15274)
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (35297) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
> ...
> ...
> ...
> 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / 
> elapsed (55301) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
> unsuccessful after 105305 (greater than max timeout of 6). Resetting 
> connection and trying again with a new connection. 
> 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
> 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
> connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
> sessionTimeout=6 watcher=org.apache.curato

[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-30 Thread Saloni (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saloni updated SPARK-33943:
---
Description: 
I have 2 spark masters and 3 zookeepers deployed on my system on separate 
virtual machines. The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

>From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
>zookeeper services are up), spark master is successfully able to create a 
>zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
>However, after both zookeeper services are up and Persistence Engine is able 
>to successfully connect and create a session; *the ZooKeeper LeaderElection 
>Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
state to ZooKeeper Initiating client connection, 
connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState

# Only zookeeper-2 is online #

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-1:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-1:: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-3:: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (35297) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (55301) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
unsuccessful after 105305 (greater than max timeout of 6). Resetting 
connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread 
shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /x/y 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

# zookeeper-2, zookeeper-3 are online # 

10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:. Will not attempt to authenticate 
using SASL (unknown error) 
10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:, initiating 

[jira] [Created] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

2020-12-30 Thread Saloni (Jira)
Saloni created SPARK-33943:
--

 Summary: Zookeeper LeaderElection Agent not being called by Spark 
Master
 Key: SPARK-33943
 URL: https://issues.apache.org/jira/browse/SPARK-33943
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
Operating System - RHEL 6.6
Reporter: Saloni


I have 2 spark masters and 3 zookeepers deployed on my system on separate 
virtual machines. The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

>From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
>zookeeper services are up), spark master is successfully able to create a 
>zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
>However, after both zookeeper services are up and Persistence Engine is able 
>to successfully connect and create a session; *the ZooKeeper LeaderElection 
>Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
state to ZooKeeper Initiating client connection, 
connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState

# Only zookeeper-2 is online #

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-1:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-1:: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:. Will not attempt to authenticate 
using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-3:: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (35297) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) 
and timeout (15000) / elapsed (55301) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
unsuccessful after 105305 (greater than max timeout of 6). Resetting 
connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: 
sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread 
shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /x/y at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

# zookeeper-2, zookeeper-3 are online # 

10:05:47.357 INFO org.apache.zookeeper.C